Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
15works
0followers
17topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

15 published item(s)

preprint2026arXiv

Covariance-Driven Regression Trees: Reducing Overfitting in CART

Decision trees are powerful machine learning algorithms, widely used in fields such as economics and medicine for their simplicity and interpretability. However, decision trees such as CART are prone to overfitting, especially when grown deep or the sample size is small. Conventional methods to reduce overfitting include pre-pruning and post-pruning, which constrain the growth of uninformative branches. In this paper, we propose a complementary approach by introducing a covariance-driven splitting criterion for regression trees (CovRT). This method is more robust to overfitting than the empirical risk minimization criterion used in CART, as it produces more balanced and stable splits and more effectively identifies covariates with true signals. We establish an oracle inequality of CovRT and prove that its predictive accuracy is comparable to that of CART in high-dimensional settings. We find that CovRT achieves superior prediction accuracy compared to CART in both simulations and real-world tasks.

preprint2026arXiv

Same Signal, Different Semantics: A Cross-Framework Behavioral Analysis of Software Engineering Agents

Behavioral studies of LLM-based software engineering agents extract operational rules about which trajectory shapes correlate with higher resolution rates: that a test step follows a code modification, that error cascades are short, or that trajectories are compact. Each rule is typically derived from a single framework, and whether it transfers, in sign as well as magnitude, to structurally different agent designs has not been directly tested. We address this at ecosystem scale: 64,380 SWE-bench runs from 126 agent configurations spanning 43 frameworks, where each configuration pairs an LLM with a framework (e.g., SWE-Agent, OpenHands) that supplies its tools and workflow. We separate framework effects from LLM effects by holding each layer fixed in turn, then measure one behavior-outcome effect per configuration and examine how those effects agree or disagree. Swapping the framework while the LLM is held fixed produces large behavioral differences in every action feature. On most signals, configurations disagree not merely in magnitude but in direction. Error rate is the cleanest case: 47 configurations resolve more issues when their error rate is lower, while 48 resolve more when it is higher. Five other continuous features and three of seven binary patterns from prior SE literature show similar directional disagreement. Framework identity accounts for more of this variation than LLM family: for mean turns, framework explains 64% of the between-configuration variance against the LLM's 10%. The implication is that the same observable behavioral signal can carry opposite meaning for different agent configurations. Behavioral findings from any single framework therefore warrant cross-configuration validation before being claimed as general.

preprint2024arXiv

Estimating and Mitigating the Congestion Effect of Curbside Pick-ups and Drop-offs: A Causal Inference Approach

Curb space is one of the busiest areas in urban road networks. Especially in recent years, the rapid increase of ride-hailing trips and commercial deliveries has induced massive pick-ups/drop-offs (PUDOs), which occupy the limited curb space that was designed and built decades ago. These PUDOs could jam curbside utilization and disturb the mainline traffic flow, evidently leading to significant negative societal externalities. However, there is a lack of an analytical framework that rigorously quantifies and mitigates the congestion effect of PUDOs in the system view, particularly with little data support and involvement of confounding effects. To bridge this research gap, this paper develops a rigorous causal inference approach to estimate the congestion effect of PUDOs on general regional networks. A causal graph is set to represent the spatio-temporal relationship between PUDOs and traffic speed, and a double and separated machine learning (DSML) method is proposed to quantify how PUDOs affect traffic congestion. Additionally, a re-routing formulation is developed and solved to encourage passenger walking and traffic flow re-routing to achieve system optimization. Numerical experiments are conducted using real-world data in the Manhattan area. On average, 100 additional units of PUDOs in a region could reduce the traffic speed by 3.70 and 4.54 mph on weekdays and weekends, respectively. Re-routing trips with PUDOs on curb space could respectively reduce the system-wide total travel time by 2.44% and 2.12% in Midtown and Central Park on weekdays. Sensitivity analysis is also conducted to demonstrate the effectiveness and robustness of the proposed framework.

preprint2023arXiv

An Empirical Study on Noisy Label Learning for Program Understanding

Recently, deep learning models have been widely applied in program understanding tasks, and these models achieve state-of-the-art results on many benchmark datasets. A major challenge of deep learning for program understanding is that the effectiveness of these approaches depends on the quality of their datasets, and these datasets often contain noisy data samples. A typical kind of noise in program understanding datasets is label noise, which means that the target outputs for some inputs are incorrect. Researchers have proposed various approaches to alleviate the negative impact of noisy labels, and formed a new research topic: noisy label learning (NLL). In this paper, we conduct an empirical study on the effectiveness of noisy label learning on deep learning for program understanding datasets. We evaluate various NLL approaches and deep learning models on three tasks: program classification, vulnerability detection, and code summarization. From the evaluation results, we come to the following findings: 1) small trained-from-scratch models are prone to label noises in program understanding, while large pre-trained models are highly robust against them. 2) NLL approaches significantly improve the program classification accuracies for small models on noisy training sets, but they only slightly benefit large pre-trained models in classification accuracies. 3) NLL can effectively detect synthetic noises in program understanding, but struggle in detecting real-world noises. We believe our findings can provide insights on the abilities of NLL in program understanding, and shed light on future works in tackling noises in software engineering datasets. We have released our code at https://github.com/jacobwwh/noise_SE.

preprint2022arXiv

Characterizing and Understanding the Behavior of Quantized Models for Reliable Deployment

Deep Neural Networks (DNNs) have gained considerable attention in the past decades due to their astounding performance in different applications, such as natural language modeling, self-driving assistance, and source code understanding. With rapid exploration, more and more complex DNN architectures have been proposed along with huge pre-trained model parameters. The common way to use such DNN models in user-friendly devices (e.g., mobile phones) is to perform model compression before deployment. However, recent research has demonstrated that model compression, e.g., model quantization, yields accuracy degradation as well as outputs disagreements when tested on unseen data. Since the unseen data always include distribution shifts and often appear in the wild, the quality and reliability of quantized models are not ensured. In this paper, we conduct a comprehensive study to characterize and help users understand the behaviors of quantized models. Our study considers 4 datasets spanning from image to text, 8 DNN architectures including feed-forward neural networks and recurrent neural networks, and 42 shifted sets with both synthetic and natural distribution shifts. The results reveal that 1) data with distribution shifts happen more disagreements than without. 2) Quantization-aware training can produce more stable models than standard, adversarial, and Mixup training. 3) Disagreements often have closer top-1 and top-2 output probabilities, and $Margin$ is a better indicator than the other uncertainty metrics to distinguish disagreements. 4) Retraining with disagreements has limited efficiency in removing disagreements. We opensource our code and models as a new benchmark for further studying the quantized models.

preprint2022arXiv

Domain Adversarial Spatial-Temporal Network: A Transferable Framework for Short-term Traffic Forecasting across Cities

Accurate real-time traffic forecast is critical for intelligent transportation systems (ITS) and it serves as the cornerstone of various smart mobility applications. Though this research area is dominated by deep learning, recent studies indicate that the accuracy improvement by developing new model structures is becoming marginal. Instead, we envision that the improvement can be achieved by transferring the "forecasting-related knowledge" across cities with different data distributions and network topologies. To this end, this paper aims to propose a novel transferable traffic forecasting framework: Domain Adversarial Spatial-Temporal Network (DASTNet). DASTNet is pre-trained on multiple source networks and fine-tuned with the target network's traffic data. Specifically, we leverage the graph representation learning and adversarial domain adaptation techniques to learn the domain-invariant node embeddings, which are further incorporated to model the temporal traffic data. To the best of our knowledge, we are the first to employ adversarial multi-domain adaptation for network-wide traffic forecasting problems. DASTNet consistently outperforms all state-of-the-art baseline methods on three benchmark datasets. The trained DASTNet is applied to Hong Kong's new traffic detectors, and accurate traffic predictions can be delivered immediately (within one day) when the detector is available. Overall, this study suggests an alternative to enhance the traffic forecasting methods and provides practical implications for cities lacking historical traffic data.

preprint2022arXiv

Estimating probabilistic dynamic origin-destination demands using multi-day traffic data on computational graphs

System-level decision making in transportation needs to understand day-to-day variation of network flows, which calls for accurate modeling and estimation of probabilistic dynamic travel demand on networks. Most existing studies estimate deterministic dynamic origin-destination (OD) demand, while the day-to-day variation of demand and flow is overlooked. Estimating probabilistic distributions of dynamic OD demand is challenging due to the complexity of the spatio-temporal networks and the computational intensity of the high-dimensional problems. With the availability of massive traffic data and the emergence of advanced computational methods, this paper develops a data-driven framework that solves the probabilistic dynamic origin-destination demand estimation (PDODE) problem using multi-day data. Different statistical distances (e.g., lp-norm, Wasserstein distance, KL divergence, Bhattacharyya distance) are used and compared to measure the gap between the estimated and the observed traffic conditions, and it is found that 2-Wasserstein distance achieves a balanced accuracy in estimating both mean and standard deviation. The proposed framework is cast into the computational graph and a reparametrization trick is developed to estimate the mean and standard deviation of the probabilistic dynamic OD demand simultaneously. We demonstrate the effectiveness and efficiency of the proposed PDODE framework on both small and real-world networks. In particular, it is demonstrated that the proposed PDODE framework can mitigate the overfitting issues by considering the demand variation. Overall, the developed PDODE framework provides a practical tool for public agencies to understand the sources of demand stochasticity, evaluate day-to-day variation of network flow, and make reliable decisions for intelligent transportation systems.

preprint2022arXiv

GraphCode2Vec: Generic Code Embedding via Lexical and Program Dependence Analyses

Code embedding is a keystone in the application of machine learning on several Software Engineering (SE) tasks. To effectively support a plethora of SE tasks, the embedding needs to capture program syntax and semantics in a way that is generic. To this end, we propose the first self-supervised pre-training approach (called GraphCode2Vec) which produces task-agnostic embedding of lexical and program dependence features. GraphCode2Vec achieves this via a synergistic combination of code analysis and Graph Neural Networks. GraphCode2Vec is generic, it allows pre-training, and it is applicable to several SE downstream tasks. We evaluate the effectiveness of GraphCode2Vec on four (4) tasks (method name prediction, solution classification, mutation testing and overfitted patch classification), and compare it with four (4) similarly generic code embedding baselines (Code2Seq, Code2Vec, CodeBERT, GraphCodeBERT) and 7 task-specific, learning-based methods. In particular, GraphCode2Vec is more effective than both generic and task-specific learning-based baselines. It is also complementary and comparable to GraphCodeBERT (a larger and more complex model). We also demonstrate through a probing and ablation study that GraphCode2Vec learns lexical and program dependence features and that self-supervised pre-training improves effectiveness.

preprint2022arXiv

Power law dependence in a random differential equation

This paper studies a random differential equation with random switch perturbations. We explore how the maximum displacement from the equilibrium state depends on the statistical properties of time series of {the} random switches. We show a power law dependence between the upper bound of displacement and the frequency of random perturbation switches, and the slope of power law dependence is dependent on the specific distribution of the intervals between switching times. This result {suggests} a quantitative connection between frequency modulation and amplitude modulation under random perturbations.

preprint2022arXiv

Substrate-mediated Borophane Polymorphs through Hydrogenation of Two-dimensional Boron Sheets

Two-dimensional boron monolayer (borophene) stands out from the two-dimensional atomic layered materials due to its structural flexibility, tunable electronic and mechanical properties from a large number of allotropic materials. The stability of pristine borophene polymorphs could possibly be improved via hydrogenation with atomic hydrogen (referred to as borophane). However, the precise adsorption structures and the underlying mechanism are still elusive. Employing first-principles calculations, we demonstrate the optimal configurations of freestanding borophanes and the ones grown on metallic substrates. For freestanding β12 and χ3 borophenes, the energetically favored hydrogen adsorption sites are on the top of the boron atoms with CN=4 (CN: coordination number), while the best absorption sites for α' borophene are on the top of the boron atoms with CN=6. With various metal substrates, the hydrogenation configurations of borophene are modulated significantly, attributed to the chemical hybridization strength between B pz and H s orbitals. These findings provide a deep insight into the hydrogenating borophenes and facilitate the stabilization of two-dimensional boron polymorphs by engineering hydrogen adsorption sites and concentrations.

preprint2021arXiv

Short-term origin-destination demand prediction in urban rail transit systems: A channel-wise attentive split-convolutional neural network method

Short-term origin-destination (OD) flow prediction in urban rail transit (URT) plays a crucial role in smart and real-time URT operation and management. Different from other short-term traffic forecasting methods, the short-term OD flow prediction possesses three unique characteristics: (1) data availability: real-time OD flow is not available during the prediction; (2) data dimensionality: the dimension of the OD flow is much higher than the cardinality of transportation networks; (3) data sparsity: URT OD flow is spatiotemporally sparse. There is a great need to develop novel OD flow forecasting method that explicitly considers the unique characteristics of the URT system. To this end, a channel-wise attentive split-convolutional neural network (CAS-CNN) is proposed. The proposed model consists of many novel components such as the channel-wise attention mechanism and split CNN. In particular, an inflow/outflow-gated mechanism is innovatively introduced to address the data availability issue. We further originally propose a masked loss function to solve the data dimensionality and data sparsity issues. The model interpretability is also discussed in detail. The CAS-CNN model is tested on two large-scale real-world datasets from Beijing Subway, and it outperforms the rest of benchmarking methods. The proposed model contributes to the development of short-term OD flow prediction, and it also lays the foundations of real-time URT operation and management.

preprint2021arXiv

Testing for Treatment Effect in Covariate-Adaptive Randomized Clinical Trials with Generalized Linear Models and Omitted Covariates

Concerns have been expressed over the validity of statistical inference under covariate-adaptive randomization despite the extensive use in clinical trials. In the literature, the inferential properties under covariate-adaptive randomization have been mainly studied for continuous responses; in particular, it is well known that the usual two sample t-test for treatment effect is typically conservative, in the sense that the actual test size is smaller than the nominal level. This phenomenon of invalid tests has also been found for generalized linear models without adjusting for the covariates and are sometimes more worrisome due to inflated Type I error. The purpose of this study is to examine the unadjusted test for treatment effect under generalized linear models and covariate-adaptive randomization. For a large class of covariate-adaptive randomization methods, we obtain the asymptotic distribution of the test statistic under the null hypothesis and derive the conditions under which the test is conservative, valid, or anti-conservative. Several commonly used generalized linear models, such as logistic regression and Poisson regression, are discussed in detail. An adjustment method is also proposed to achieve a valid size based on the asymptotic results. Numerical studies confirm the theoretical findings and demonstrate the effectiveness of the proposed adjustment method.

preprint2020arXiv

Efficient Estimation of Mixture Cure Frailty Model for Clustered Current Status Data

Current status data abounds in the field of epidemiology and public health, where the only observable data for a subject is the random inspection time and the event status at inspection. Motivated by such a current status data from a periodontal study where data are inherently clustered, we propose a unified methodology to analyze such complex data. We allow the time-to-event to follow the semiparametric GOR model with a cure fraction, and develop a unified estimation scheme powered by the EM algorithm. The within-subject correlation is accounted for by a random (frailty) effect, and the non-parametric component of the GOR model is approximated via penalized splines, with a set of knot points that increases with the sample size. Proposed methodology is accompanied by a rigorous asymptotic theory, and the related semiparametric efficiency. The finite sample performance of our model parameters are assessed via simulation studies. Furthermore, the proposed methodology is illustrated via application to the oral health data, accompanied by diagnostic checks to identify influential observations. An easy to use R package CRFCSD is also available for implementation.

preprint2020arXiv

Statistical Inference for Covariate-Adaptive Randomization Procedures

Covariate-adaptive randomization (CAR) procedures are frequently used in comparative studies to increase the covariate balance across treatment groups. However, because randomization inevitably uses the covariate information when forming balanced treatment groups, the validity of classical statistical methods after such randomization is often unclear. In this article, we derive the theoretical properties of statistical methods based on general CAR under the linear model framework. More importantly, we explicitly unveil the relationship between covariate-adaptive and inference properties by deriving the asymptotic representations of the corresponding estimators. We apply the proposed general theory to various randomization procedures such as complete randomization, rerandomization, pairwise sequential randomization, and Atkinson's $D_A$-biased coin design and compare their performance analytically. Based on the theoretical results, we then propose a new approach to obtain valid and more powerful tests. These results open a door to understand and analyze experiments based on CAR. Simulation studies provide further evidence of the advantages of the proposed framework and the theoretical results. Supplementary materials for this article are available online.

preprint2019arXiv

Tannin-controlled micelles and fibrils of $κ$-casein

Effects of green tea tannin epigallocatechin-gallate (EGCG) on thermal-stress-induced amyloid fibril formation of reduced carboxymethylated bovine milk protein $κ$-casein (RCMK) were studied by dynamical light scattering (DLS) and small angle x-rays scattering (SAXS). Two populations of aggregates, micelles and fibrils, dominated the time evolution of light scattering intensity and of effective hydrodynamic diameter. SAXS experiments allowed to resolve micelles and fibrils so that the time dependence of scattering profile revealed structural evolution of the two populations. The low-Q scattering intensity prior to an expected increase with time due to fibril growth, shows an intriguing rapid decrease which is interpreted as the release of monomers from micelles. This phenomenon, observed both in the absence and in the presence of EGCG, indicates that under thermal stress free native monomers are converted to amyloid-prone monomers that do not form micelles. The consumption of free native monomers results in a release of native monomers from micelles, because only native protein participate in micelle-monomer (quasi-)equilibrium. This release is reversible, indicating also that native-to-amyloid-prone monomers conversion is reversible as well. We show that EGCG does not bind to protein in fibrils, neither does it affect/prevent the pro-amyloid conversion of monomers. EGCG hinders the addition of monomers to growing fibrils. These facts allowed us to propose kinetics model for EGCG-controlled amyloid aggregation of micellar proteins. Therein, we introduced the growth-rate inhibition function which quantitatively accounts for the effect of EGCG on the fibril growth at any degree of thermal stress.