Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
14works
0followers
14topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

14 published item(s)

preprint2026arXiv

Synthetic Pre-Pre-Training Improves Language Model Robustness to Noisy Pre-Training Data

Large language models (LLMs) rely on web-scale corpora for pre-training. The noise inherent in these datasets tends to obscure meaningful patterns and ultimately degrade model performance. Data curation mitigates but cannot eliminate such noise, so pre-training corpora remain noisy in practice. We therefore study whether a lightweight pre-pre-training (PPT) stage based on synthetic data with learnable temporal structure helps resist noisy data during the pre-training (PT) stage. Across various corruption settings, our method consistently improves robustness to noise during PT, with larger relative gains at higher noise levels. For a 1B-parameter model, a synthetic PPT stage with only 65M tokens achieves the same final loss as the baseline while using up to 49\% fewer natural-text PT tokens across different noise levels. Mechanistic analyses suggest PPT does not immediately suppress attention to noisy tokens. Rather, PPT-initialized models gradually downweight attention between corrupted tokens during noisy PT. This indicates that synthetic PPT inhibits noise self-modeling and shapes the subsequent optimization trajectory. Code is available at https://github.com/guox18/formal-language-prepretraining.

preprint2023arXiv

Distributed local spline simulator for wave propagation

Numerical simulation of wave propagation in elastic media faces the challenges arising from increasing demand of high resolution in modern 3-D imaging applications, which requires a balance between efficiency and accuracy in addition to being friendly to the distributed high-performance computing environment. In this paper, we propose a distributed local spline simulator (LOSS) for solving the wave equation. LOSS uses patched cubic B-splines to represent the wavefields and attains an accurate evaluation of spatial derivatives with linear complexity. In order to link the adjacent patches, a perfectly matched boundary condition is introduced to give a closure of local spline coefficients. Owing to the rapid decay property of the local wavelets in dual space, it can recover the global spline as accurately as possible only at the cost of local communications among adjacent neighbors. Several typical numerical examples, including 2-D acoustic wave equation and P- and S- wave propagation in 3-D homogenous or heterogenous media, are provided to validate its convergence, accuracy and parallel scalability.

preprint2022arXiv

A Generic Algorithm for Top-K On-Shelf Utility Mining

On-shelf utility mining (OSUM) is an emerging research direction in data mining. It aims to discover itemsets that have high relative utility in their selling time period. Compared with traditional utility mining, OSUM can find more practical and meaningful patterns in real-life applications. However, there is a major drawback to traditional OSUM. For normal users, it is hard to define a minimum threshold minutil for mining the right amount of on-shelf high utility itemsets. On one hand, if the threshold is set too high, the number of patterns would not be enough. On the other hand, if the threshold is set too low, too many patterns will be discovered and cause an unnecessary waste of time and memory consumption. To address this issue, the user usually directly specifies a parameter k, where only the top-k high relative utility itemsets would be considered. Therefore, in this paper, we propose a generic algorithm named TOIT for mining Top-k On-shelf hIgh-utility paTterns to solve this problem. TOIT applies a novel strategy to raise the minutil based on the on-shelf datasets. Besides, two novel upper-bound strategies named subtree utility and local utility are applied to prune the search space. By adopting the strategies mentioned above, the TOIT algorithm can narrow the search space as early as possible, improve the mining efficiency, and reduce the memory consumption, so it can obtain better performance than other algorithms. A series of experiments have been conducted on real datasets with different styles to compare the effects with the state-of-the-art KOSHU algorithm. The experimental results showed that TOIT outperforms KOSHU in both running time and memory consumption.

preprint2022arXiv

An efficient and easy-to-extend Matlab code of the Moving Morphable Component (MMC) method for three-dimensional topology optimization

Explicit topology optimization methods have received ever-increasing interest in recent years. In particular, a 188-line Matlab code of the two-dimensional (2D) Moving Morphable Component (MMC)-based topology optimization method was released by Zhang et al. (Struct Multidiscip Optim 53(6):1243-1260, 2016). The present work aims to propose an efficient and easy-to-extend 256-line Matlab code of the MMC method for three-dimensional (3D) topology optimization implementing some new numerical techniques. To be specific, by virtue of the function aggregation technique, accurate sensitivity analysis, which is also easy-to-extend to other problems, is achieved. Besides, based on an efficient identification algorithm for load transmission path, the degrees of freedoms (DOFs) not belonging to the load transmission path are removed in finite element analysis (FEA), which significantly accelerates the optimization process. As a result, compared to the corresponding 188-line 2D code, the performance of the optimization results, the computational efficiency of FEA, and the convergence rate and the robustness of optimization process are greatly improved. For the sake of completeness, a refined 218-line Matlab code implementing the 2D-MMC method is also provided.

preprint2022arXiv

Federated Learning for Personalized Humor Recognition

Computational understanding of humor is an important topic under creative language understanding and modeling. It can play a key role in complex human-AI interactions. The challenge here is that human perception of humorous content is highly subjective. The same joke may receive different funniness ratings from different readers. This makes it highly challenging for humor recognition models to achieve personalization in practical scenarios. Existing approaches are generally designed based on the assumption that users have a consensus on whether a given text is humorous or not. Thus, they cannot handle diverse humor preferences well. In this paper, we propose the FedHumor approach for the recognition of humorous content in a personalized manner through Federated Learning (FL). Extending a pre-trained language model, FedHumor guides the fine-tuning process by considering diverse distributions of humor preferences from individuals. It incorporates a diversity adaptation strategy into the FL paradigm to train a personalized humor recognition model. To the best of our knowledge, FedHumor is the first text-based personalized humor recognition model through federated learning. Extensive experiments demonstrate the advantage of FedHumor in recognizing humorous texts compared to nine state-of-the-art humor recognition approaches with superior capability for handling the diversity in humor labels produced by users with diverse preferences.

preprint2022arXiv

Model-Free Statistical Inference on High-Dimensional Data

This paper aims to develop an effective model-free inference procedure for high-dimensional data. We first reformulate the hypothesis testing problem via sufficient dimension reduction framework. With the aid of new reformulation, we propose a new test statistic and show that its asymptotic distribution is $χ^2$ distribution whose degree of freedom does not depend on the unknown population distribution. We further conduct power analysis under local alternative hypotheses. In addition, we study how to control the false discovery rate of the proposed $χ^2$ tests, which are correlated, to identify important predictors under a model-free framework. To this end, we propose a multiple testing procedure and establish its theoretical guarantees. Monte Carlo simulation studies are conducted to assess the performance of the proposed tests and an empirical analysis of a real-world data set is used to illustrate the proposed methodology.

preprint2022arXiv

Topology optimization on complex surfaces based on the moving morphable component (MMC) method and computational conformal mapping (CCM)

In the present paper, an integrated paradigm for topology optimization on complex surfaces with arbitrary genus is proposed. The approach is constructed based on the two-dimensional (2D) Moving Morphable Component (MMC) framework, where a set of structural components are used as the basic units of optimization, and computational conformal mapping (CCM) technique, with which a complex surface represented by an unstructured triangular mesh can be mapped into a set of regular 2D parameter domains numerically. A multi-patch stitching scheme is also developed to achieve an MMC-friendly global parameterization through a number of local parameterizations. Numerical examples including a saddle-shaped shell, a torus-shape shell and a tee-branch pipe are solved to demonstrate the validity and efficiency of the proposed approach. It is found that compared with traditional approaches for topology optimization on 2D surfaces, optimized designs with clear load transmission paths can be obtained with much fewer numbers of design variables and degrees of freedom for finite element analysis (FEA) via the proposed approach.

preprint2021arXiv

Optimisation of spatially varying orthotropic porous structures based on conformal mapping

In this article, a compliance minimisation scheme for designing spatially varying orthotropic porous structures is proposed. With the utilisation of conformal mapping, the porous structures here can be generated by two controlling field variables, the (logarithm of) the local scaling factor and the rotational angle of the matrix cell, and they are interrelated through the Cauchy-Riemann equations. Thus the design variables are simply reduced to the logarithm values of the local scaling factor on selected boundary points. Other attractive features shown by the present method are summarised as follows. Firstly, with the condition of total differential automatically met by the two controlling field variables, the integrability problem which necessitates post-processing treatments in many other similar methods can be resolved naturally. Secondly, according to the maximum principle for harmonic functions, the minimum feature size can be explicitly monitored during optimisation. Thirdly, the rotational symmetry possessed by the matrix cell can be fully exploited in the context of conformal mapping, and the computational cost for solving the cell problems for the homogenised elasticity tensor is maximally abased. In particular, when the design domain takes a rectangle shape, analytical expressions for the controlling fields are available. The homogenised results are shown, both theoretically and numerically, to converge to the corresponding fine-scale results, and the effectiveness of the proposed work is further demonstrated with more numerical examples.

preprint2020arXiv

A New Procedure for Controlling False Discovery Rate in Large-Scale t-tests

This paper is concerned with false discovery rate (FDR) control in large-scale multiple testing problems. We first propose a new data-driven testing procedure for controlling the FDR in large-scale t-tests for one-sample mean problem. The proposed procedure achieves exact FDR control in finite sample settings when the populations are symmetric no matter the number of tests or sample sizes. Comparing with the existing bootstrap method for FDR control, the proposed procedure is computationally efficient. We show that the proposed method can control the FDR asymptotically for asymmetric populations even when the test statistics are not independent. We further show that the proposed procedure with a simple correction is as accurate as the bootstrap method to the second-order degree, and could be much more effective than the existing normal calibration. We extend the proposed procedure to two-sample mean problem. Empirical results show that the proposed procedures have better FDR control than existing ones when the proportion of true alternative hypotheses is not too low, while maintaining reasonably good detection ability.

preprint2020arXiv

Deep Learning Inversion of Electrical Resistivity Data

The inverse problem of electrical resistivity surveys (ERSs) is difficult because of its nonlinear and ill-posed nature. For this task, traditional linear inversion methods still face challenges such as suboptimal approximation and initial model selection. Inspired by the remarkable nonlinear mapping ability of deep learning approaches, in this article, we propose to build the mapping from apparent resistivity data (input) to resistivity model (output) directly by convolutional neural networks (CNNs). However, the vertically varying characteristic of patterns in the apparent resistivity data may cause ambiguity when using CNNs with the weight sharing and effective receptive field properties. To address the potential issue, we supply an additional tier feature map to CNNs to help those aware of the relationship between input and output. Based on the prevalent U-Net architecture, we design our network (ERSInvNet) that can be trained end-to-end and can reach a very fast inference speed during testing. We further introduce a depth weighting function and a smooth constraint into loss function to improve inversion accuracy for the deep region and suppress false anomalies. Six groups of experiments are considered to demonstrate the feasibility and efficiency of the proposed methods. According to the comprehensive qualitative analysis and quantitative comparison, ERSInvNet with tier feature map, smooth constraints, and depth weighting function together achieve the best performance.

preprint2020arXiv

Generation of smoothly-varying infill configurations from a continuous menu of cell patterns and the asymptotic analysis of its mechanical behaviour

We here introduce a novel scheme for generating smoothly-varying infill graded microstructural (IGM) configurations from a given menu of generating cells. The scheme was originally proposed for essentially improving the variety of describable configurations in a modified asymptotic homogenisation-based topology optimisation framework [1] for fast IGM design. But the proposed scheme, after modification, also demonstrates its unique values in two aspects of applications. First, it provides a fairly simple way of generating an IGM configuration continuously patching any given cell configurations. Second, it tenders a straightforward mean for decorating microstructures on a given manifold. We will further show that the form of topology description function given here effectively offers a platform for unifying most existing approaches for IGM generation. Fuelled by asymptotic analysis of the mechanical behaviour of the resulting IGM configurations, a topology optimisation scheme for compliance minimisation is introduced. We will finally show that, the use of the present scheme helps reduce the compliance value of an optimised structure by nearly a half, if compared with that from the original framework [1].

preprint2020arXiv

Surrogate representation of sink strengths and the long-term role of crystalline interfaces in the development of irradiation-induced bubbles

The present article addresses an early-stage attempt on replacing the analyticity-based sink strength terms in rate equations by surrogate models of machine learning representation. Here we emphasise, in the context of multiscale modelling, a combinative use of machine learning with scale analysis, through which a set of fine-resolution problems of partial differential equations describing the (quasi-steady) short-range individual sink behaviour can be asymptotically sorted out from the mean-field kinetics. Hence the training of machine learning is restrictively oriented, that is, to express the local and already identified, but analytically unavailable nonlinear functional relationships between the sink strengths and other local continuum field quantities. With the trained models, one is enabled to quantitatively investigate the biased effect shown by a void/bubble being a point defect sink, and the results are compared with existing ones over well-studied scenarios. Moreover, the faster diffusive mechanisms on crystalline interfaces are distinguishingly modelled by locally planar rate equations, and their linkages with rate equations for bulk diffusion are formulated through derivative jumps of point defect concentrations across the interfaces. Thus the distinctive role of crystalline interfaces as partial sinks and quick diffusive channels can be investigated. Methodologicalwise, the present treatment is also applicable for studying more complicated situation of long-term sink behaviour observed in irradiated materials.

preprint2020arXiv

The Role of Propensity Score Structure in Asymptotic Efficiency of Estimated Conditional Quantile Treatment Effect

When a strict subset of covariates are given, we propose conditional quantile treatment effect to capture the heterogeneity of treatment effects via the quantile sheet that is the function of the given covariates and quantile. We focus on deriving the asymptotic normality of probability score-based estimators under parametric, nonparametric and semiparametric structure. We make a systematic study on the estimation efficiency to check the importance of propensity score structure and the essential differences from the unconditional counterparts. The derived unique properties can answer: what is the general ranking of these estimators? how does the affiliation of the given covariates to the set of covariates of the propensity score affect the efficiency? how does the convergence rate of the estimated propensity score affect the efficiency? and why would semiparametric estimation be worth of recommendation in practice? We also give a brief discussion on the extension of the methods to handle large-dimensional scenarios and on the estimation for the asymptotic variances. The simulation studies are conducted to examine the performances of these estimators. A real data example is analyzed for illustration and some new findings are acquired.

preprint2019arXiv

Optimal design of shell-graded-infill structures by a hybrid MMC-MMV approach

In the present work, a hybrid MMC-MMV approach is developed for designing additive manufacturing-oriented shell-graded-infill structures. The key idea is to describe the geometry of a shell-graded-infill structure explicitly using some geometry parameters. To this end, a set of morphable voids is adopted to describe the boundary of the coating shell, while a set of morphable components combing with a coordinate perturbation technique are introduced to represent the graded infill distribution. Under such treatment, both the crisp boundary of the coating shell and the graded infill can be optimized simultaneously, with a small number of design variables. Numerical examples demonstrate the effectiveness of the proposed approach.