Source author record

Zhen Xu

Zhen Xu appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Artificial Intelligence Computer Vision Computation and Language Robotics Biological Physics cond-mat.mtrl-sci cond-mat.soft cs.CY Distributed, Parallel, and Cluster Computing math.PR Neural and Evolutionary Computing physics.flu-dyn physics.med-ph Social and Information Networks

Catalog footprint

What is connected

19works

15topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning

Multi-agent systems have evolved into practical LLM-driven collaborators for many applications, gaining robustness from diversity and cross-checking. However, multi-agent RL (MARL) training is resource-intensive and unstable: co-adapting teammates induce non-stationarity, and rewards are often sparse and high-variance. Therefore, we introduce \textbf{Multi-Agent Test-Time Reinforcement Learning (MATTRL)}, a framework that injects structured textual experience into multi-agent deliberation at inference time. MATTRL forms a multi-expert team of specialists for multi-turn discussions, retrieves and integrates test-time experiences, and reaches consensus for final decision-making. We also study credit assignment for constructing a turn-level experience pool, then reinjecting it into the dialogue. Across challenging benchmarks in medicine, math, and education, MATTRL improves accuracy by an average of 3.67\% over a multi-agent baseline, and by 8.67\% over comparable single-agent baselines. Ablation studies examine different credit-assignment schemes and provide a detailed comparison of how they affect training outcomes. MATTRL offers a stable, effective and efficient path to distribution-shift-robust multi-agent reasoning without tuning.

preprint2026arXiv

Enhancing LLM-Based Data Annotation with Error Decomposition

Large language models offer a scalable alternative to human coding for data annotation tasks, enabling the scale-up of research across data-intensive domains. While LLMs are already achieving near-human accuracy on objective annotation tasks, their performance on subjective annotation tasks, such as those involving psychological constructs, is less consistent and more prone to errors. Standard evaluation practices typically collapse all annotation errors into a single alignment metric, but this simplified approach may obscure different kinds of errors that affect final analytical conclusions in different ways. Here, we propose a diagnostic evaluation paradigm that incorporates a human-in-the-loop step to separate task-inherent ambiguity from model-driven inaccuracies and assess annotation quality in terms of their potential downstream impacts. We refine this paradigm on ordinal annotation tasks, which are common in subjective annotation. The refined paradigm includes: (1) a diagnostic taxonomy that categorizes LLM annotation errors along two dimensions: source (model-specific vs. task-inherent) and type (boundary ambiguity vs. conceptual misidentification); (2) a lightweight human annotation test to estimate task-inherent ambiguity from LLM annotations; and (3) a computational method to decompose observed LLM annotation errors following our taxonomy. We validate this paradigm on four educational annotation tasks, demonstrating both its conceptual validity and practical utility. Theoretically, our work provides empirical evidence for why excessively high alignment is unrealistic in specific annotation tasks and why single alignment metrics inadequately reflect the quality of LLM annotations. In practice, our paradigm can be a low-cost diagnostic tool that assesses the suitability of a given task for LLM annotation and provides actionable insights for further technical optimization.

preprint2026arXiv

Evaluating 21st-Century Competencies in Postsecondary Curricula with Large Language Models: Performance Benchmarking and Reasoning-Based Prompting Strategies

The growing emphasis on 21st-century competencies in postsecondary education, intensified by the transformative impact of generative AI, underscores the need to evaluate how these competencies are embedded in curricula and how effectively academic programs align with evolving workforce and societal demands. Curricular Analytics, particularly recent generative AI-powered approaches, offer a promising data-driven pathway. However, analyzing 21st-century competencies requires pedagogical reasoning beyond surface-level information retrieval, and the capabilities of large language models in this context remain underexplored. In this study, we extend prior curricular analytics research by examining a broader range of curriculum documents, competency frameworks, and models. Using 7,600 manually annotated curriculum-competency alignment scores, we assess the informativeness of different curriculum sources, benchmark general-purpose LLMs for curriculum-to-competency mapping, and analyze error patterns. We further introduce a reasoning-based prompting strategy, Curricular CoT, to strengthen LLMs' pedagogical reasoning. Our results show that detailed instructional activity descriptions are the most informative type of curriculum document for competency analytics. Open-weight LLMs achieve accuracy comparable to proprietary models on coarse-grained tasks, demonstrating their scalability and cost-effectiveness for institutional use. However, no model reaches human-level precision in fine-grained pedagogical reasoning. Our proposed Curricular CoT yields modest improvements by reducing bias in instructional keyword inference and improving the detection of nuanced pedagogical evidence in long text. Together, these findings highlight the untapped potential of institutional curriculum documents and provide an empirical foundation for advancing AI-driven curricular analytics.

preprint2026arXiv

Temporal Regularization Training: Unleashing the Potential of Spiking Neural Networks

Spiking Neural Networks (SNNs) have received widespread attention due to their event-driven and low-power characteristics, making them particularly effective for processing neuromorphic data. Recent studies have shown that directly trained SNNs suffer from severe temporal gradient vanishing and overfitting issues, which fundamentally constrain their performance and generalizability. This paper unveils a temporal regularization training (TRT) memthod, designed to unleash the generalization and performance potential of SNNs through a time-decaying regularization mechanism that prioritizes early timesteps with stronger constraints. We perform theoretical analysis to reveal TRT's ability on mitigating the temporal gradient vanishment. To validate the effectiveness of TRT, we conduct experiments on both static image datasets and dynamic neuromorphic datasets, perform analysis of their results, demonstrating that TRT can effectively mitigate overfitting and help SNNs converge into flatter local minima with better generalizability. Furthermore, we establish a theoretical interpretation of TRT's temporal regularization mechanism by analyzing the temporal information dynamics inside SNNs. We track the Fisher information of SNNs during training process, showing that Fisher information progressively concentrates in early timesteps. The time-decaying regularization mechanism implemented in TRT effectively guides the network to learn robust features in early timesteps with rich information, thereby leading to significant improvements in model generalization.

preprint2022arXiv

Bridging the Gap of AutoGraph between Academia and Industry: Analysing AutoGraph Challenge at KDD Cup 2020

Graph structured data is ubiquitous in daily life and scientific areas and has attracted increasing attention. Graph Neural Networks (GNNs) have been proved to be effective in modeling graph structured data and many variants of GNN architectures have been proposed. However, much human effort is often needed to tune the architecture depending on different datasets. Researchers naturally adopt Automated Machine Learning on Graph Learning, aiming to reduce the human effort and achieve generally top-performing GNNs, but their methods focus more on the architecture search. To understand GNN practitioners' automated solutions, we organized AutoGraph Challenge at KDD Cup 2020, emphasizing on automated graph neural networks for node classification. We received top solutions especially from industrial tech companies like Meituan, Alibaba and Twitter, which are already open sourced on Github. After detailed comparisons with solutions from academia, we quantify the gaps between academia and industry on modeling scope, effectiveness and efficiency, and show that (1) academia AutoML for Graph solutions focus on GNN architecture search while industrial solutions, especially the winning ones in the KDD Cup, tend to obtain an overall solution (2) by neural architecture search only, academia solutions achieve on average 97.3% accuracy of industrial solutions (3) academia solutions are cheap to obtain with several GPU hours while industrial solutions take a few months' labors. Academic solutions also contain much fewer parameters.

preprint2022arXiv

Codabench: Flexible, Easy-to-Use and Reproducible Benchmarking Platform

Obtaining standardized crowdsourced benchmark of computational methods is a major issue in data science communities. Dedicated frameworks enabling fair benchmarking in a unified environment are yet to be developed. Here we introduce Codabench, an open-source, community-driven platform for benchmarking algorithms or software agents versus datasets or tasks. A public instance of Codabench (https://www.codabench.org/) is open to everyone, free of charge, and allows benchmark organizers to compare fairly submissions, under the same setting (software, hardware, data, algorithms), with custom protocols and data formats. Codabench has unique features facilitating the organization of benchmarks flexibly, easily and reproducibly, such as the possibility of re-using templates of benchmarks, and supplying compute resources on-demand. Codabench has been used internally and externally on various applications, receiving more than 130 users and 2500 submissions. As illustrative use cases, we introduce 4 diverse benchmarks covering Graph Machine Learning, Cancer Heterogeneity, Clinical Diagnosis and Reinforcement Learning.

preprint2022arXiv

Comparison of Spatio-Temporal Models for Human Motion and Pose Forecasting in Face-to-Face Interaction Scenarios

Human behavior forecasting during human-human interactions is of utmost importance to provide robotic or virtual agents with social intelligence. This problem is especially challenging for scenarios that are highly driven by interpersonal dynamics. In this work, we present the first systematic comparison of state-of-the-art approaches for behavior forecasting. To do so, we leverage whole-body annotations (face, body, and hands) from the very recently released UDIVA v0.5, which features face-to-face dyadic interactions. Our best attention-based approaches achieve state-of-the-art performance in UDIVA v0.5. We show that by autoregressively predicting the future with methods trained for the short-term future (<400ms), we outperform the baselines even for a considerably longer-term future (up to 2s). We also show that this finding holds when highly noisy annotations are used, which opens new horizons towards the use of weakly-supervised learning. Combined with large-scale datasets, this may help boost the advances in this field.

preprint2022arXiv

Confidence Propagation Cluster: Unleash Full Potential of Object Detectors

It has been a long history that most object detection methods obtain objects by using the non-maximum suppression (NMS) and its improved versions like Soft-NMS to remove redundant bounding boxes. We challenge those NMS-based methods from three aspects: 1) The bounding box with highest confidence value may not be the true positive having the biggest overlap with the ground-truth box. 2) Not only suppression is required for redundant boxes, but also confidence enhancement is needed for those true positives. 3) Sorting candidate boxes by confidence values is not necessary so that full parallelism is achievable. In this paper, inspired by belief propagation (BP), we propose the Confidence Propagation Cluster (CP-Cluster) to replace NMS-based methods, which is fully parallelizable as well as better in accuracy. In CP-Cluster, we borrow the message passing mechanism from BP to penalize redundant boxes and enhance true positives simultaneously in an iterative way until convergence. We verified the effectiveness of CP-Cluster by applying it to various mainstream detectors such as FasterRCNN, SSD, FCOS, YOLOv3, YOLOv5, Centernet etc. Experiments on MS COCO show that our plug and play method, without retraining detectors, is able to steadily improve average mAP of all those state-of-the-art models with a clear margin from 0.3 to 1.9 respectively when compared with NMS-based methods.

preprint2022arXiv

Didn't see that coming: a survey on non-verbal social human behavior forecasting

Non-verbal social human behavior forecasting has increasingly attracted the interest of the research community in recent years. Its direct applications to human-robot interaction and socially-aware human motion generation make it a very attractive field. In this survey, we define the behavior forecasting problem for multiple interactive agents in a generic way that aims at unifying the fields of social signals prediction and human motion forecasting, traditionally separated. We hold that both problem formulations refer to the same conceptual problem, and identify many shared fundamental challenges: future stochasticity, context awareness, history exploitation, etc. We also propose a taxonomy that comprises methods published in the last 5 years in a very informative way and describes the current main concerns of the community with regard to this problem. In order to promote further research on this field, we also provide a summarised and friendly overview of audiovisual datasets featuring non-acted social interactions. Finally, we describe the most common metrics used in this task and their particular issues.

preprint2022arXiv

Winning solutions and post-challenge analyses of the ChaLearn AutoDL challenge 2019

This paper reports the results and post-challenge analyses of ChaLearn's AutoDL challenge series, which helped sorting out a profusion of AutoML solutions for Deep Learning (DL) that had been introduced in a variety of settings, but lacked fair comparisons. All input data modalities (time series, images, videos, text, tabular) were formatted as tensors and all tasks were multi-label classification problems. Code submissions were executed on hidden tasks, with limited time and computational resources, pushing solutions that get results quickly. In this setting, DL methods dominated, though popular Neural Architecture Search (NAS) was impractical. Solutions relied on fine-tuned pre-trained networks, with architectures matching data modality. Post-challenge tests did not reveal improvements beyond the imposed time limit. While no component is particularly original or novel, a high level modular organization emerged featuring a "meta-learner", "data ingestor", "model selector", "model/learner", and "evaluator". This modularity enabled ablation studies, which revealed the importance of (off-platform) meta-learning, ensembling, and efficient data management. Experiments on heterogeneous module combinations further confirm the (local) optimality of the winning solutions. Our challenge legacy includes an ever-lasting benchmark (http://autodl.chalearn.org), the open-sourced code of the winners, and a free "AutoDL self-service".

preprint2021arXiv

Growth and Collapse of an Isolated Bubble Driven by a Single Negative Histotripsy Cycle in Agarose Gel: Stress, Strain, and Strain Rate Fields

Histotripsy relies on cavitation to mechanically homogenize soft tissue. There is strong evidence that the high stresses, strains, and strain rates developed as bubbles grow and collapse contribute to this tissue homogenization. While such stresses and strains have been examined computationally in model systems with assumed constitutive models (e.g., finite-deformation Neo-Hookean model) and viscoelastic properties determined under quasi-static conditions, recent studies proposed that the Quadratic Law Kelvin-Voigt (QLKV) constitutive model, which additionally accounts for strain stiffening, more accurately represents the viscoelastic response of soft materials subjected to cavitation; this model has also been used to infer viscoelastic properties at high rates. In this work, we use the QLKV model and these properties to calculate the time-dependent stress, strain, and strain rate fields produced during the growth and collapse of individual bubbles subjected to a histotripsy-relevant pressure waveform in agarose gels of 0.3~\% and 1.0~\% concentration and corresponding to actual (past) experiments. We find that, as the gel concentration is increased, strain stiffening manifests in larger elastic stresses and compressive stresses extending into the collapse phase, particularly for the 1.0~\% concentration gel. As a result, the duration of the collapse phase also increases. In comparison with the conventional Neo-Hookean model, the compressive stress has a larger magnitude, extends farther into the surrounding medium, and shows an increased departure from growth/collapse symmetry close to the bubble; all of these effects are magnified in the stiffer gel.

preprint2021arXiv

NeoRL: A Near Real-World Benchmark for Offline Reinforcement Learning

Offline reinforcement learning (RL) aims at learning a good policy from a batch of collected data, without extra interactions with the environment during training. However, current offline RL benchmarks commonly have a large reality gap, because they involve large datasets collected by highly exploratory policies, and the trained policy is directly evaluated in the environment. In real-world situations, running a highly exploratory policy is prohibited to ensure system safety, the data is commonly very limited, and a trained policy should be well validated before deployment. In this paper, we present a near real-world offline RL benchmark, named NeoRL, which contains datasets from various domains with controlled sizes, and extra test datasets for policy validation. We evaluate existing offline RL algorithms on NeoRL and argue that the performance of a policy should also be compared with the deterministic version of the behavior policy, instead of the dataset reward. The empirical results demonstrate that the tested offline RL algorithms become less competitive to the deterministic policy on many datasets, and the offline policy evaluation hardly helps. The NeoRL suit can be found at http://polixir.ai/research/neorl. We hope this work will shed some light on future research and draw more attention when deploying RL in real-world systems.

preprint2021arXiv

Physics-based Constitutive Modeling of Photo-oxidative Aging in Semi-Crystalline Polymers based on Chemical Characterization Techniques

This paper proposes a physio-chemically-based constitutive framework to simulate and predict the response of semi-crystalline low-density polyethylene (LDPE) to severe photo-oxidation. Photo-oxidation induced by exposure to Ultra-Violet (UV) light and oxygen is the dominant degradation mechanism affecting the lifespan of LDPE. In this work, we propose evolution functions for the material properties in the constitutive equations of \cite{boyce2000constitutive} to incorporate the effects of photo-oxidation on the mechanical response of LDPE. The evolution functions are based on chemically verified processes that are responsible for material degradation, namely the change in crystallinity and mass loss relative to the initial pristine films over exposure time. Changes in crystallinity and mass loss are characterized by Differential Scanning Calorimetry (DSC) and Quartz Crystal Microbalance with Dissipation Monitoring (QCM-D) experiments, respectively. Connecting the physio-chemical processes affecting polymer network evolution to the mechanical response of LDPE bypasses the need for defining fitting parameters that carry no physical meaning. The developed constitutive framework is validated with respect to a series of in-house uniaxial tensile tests performed on LDPE aged for different UV exposure times. Comparison of the constitutive framework versus experimental mechanical tests also confirms the accuracy of DSC and QCM-D as rigorous techniques to monitor and characterize degradation in LDPE films. The outcome shed light on the evolution of the macromolecular network in LDPE under extreme photo-oxidation and the evolution of the associated mechanical material properties.

preprint2020arXiv

Acoustic measurements of the nucleus size distribution at the cavitation threshold

Understanding the acoustic cavitation threshold is essential for minimizing cavitation bioeffects in diagnostic ultrasound and for controlling cavitation--mediated tissue ablation in focused ultrasound procedures. The homogeneous cavitation threshold is an intrinsic material property of recognized importance to a variety of applications requiring cavitation control. However, acoustic measurements of the cavitation threshold in water differ from those predicted by classical nucleation theories. This persistent discrepancy is explained by combining novel methods for acoustically nucleating single bubbles at threshold with numerical modeling to obtain a nucleus size distribution consistent with first--principles estimates for ion--stabilized nucleii. We identify acoustic cavitation at threshold as a reproducible subtype of heterogeneous cavitation with a characteristic nucleus size distribution. Knowledge of the nucleus size distribution could inspire new approaches for achieving cavitation control in water, tissue, and a variety of other media.

preprint2020arXiv

Flow Contrastive Estimation of Energy-Based Models

This paper studies a training method to jointly estimate an energy-based model and a flow-based model, in which the two models are iteratively updated based on a shared adversarial value function. This joint training method has the following traits. (1) The update of the energy-based model is based on noise contrastive estimation, with the flow model serving as a strong noise distribution. (2) The update of the flow model approximately minimizes the Jensen-Shannon divergence between the flow model and the data distribution. (3) Unlike generative adversarial networks (GAN) which estimates an implicit probability distribution defined by a generator model, our method estimates two explicit probabilistic distributions on the data. Using the proposed method we demonstrate a significant improvement on the synthesis quality of the flow model, and show the effectiveness of unsupervised feature learning by the learned energy-based model. Furthermore, the proposed training method can be easily adapted to semi-supervised learning. We achieve competitive results to the state-of-the-art semi-supervised learning methods.

preprint2020arXiv

Learning the Graphical Structure of Electronic Health Records with Graph Convolutional Transformer

Effective modeling of electronic health records (EHR) is rapidly becoming an important topic in both academia and industry. A recent study showed that using the graphical structure underlying EHR data (e.g. relationship between diagnoses and treatments) improves the performance of prediction tasks such as heart failure prediction. However, EHR data do not always contain complete structure information. Moreover, when it comes to claims data, structure information is completely unavailable to begin with. Under such circumstances, can we still do better than just treating EHR data as a flat-structured bag-of-features? In this paper, we study the possibility of jointly learning the hidden structure of EHR while performing supervised prediction tasks on EHR data. Specifically, we discuss that Transformer is a suitable basis model to learn the hidden EHR structure, and propose Graph Convolutional Transformer, which uses data statistics to guide the structure learning process. The proposed model consistently outperformed previous approaches empirically, on both synthetic data and publicly available EHR data, for various prediction tasks such as graph reconstruction and readmission prediction, indicating that it can serve as an effective general-purpose representation learning algorithm for EHR data.

preprint2020arXiv

Online Learning with Cumulative Oversampling: Application to Budgeted Influence Maximization

We propose a cumulative oversampling (CO) method for online learning. Our key idea is to sample parameter estimations from the updated belief space once in each round (similar to Thompson Sampling), and utilize the cumulative samples up to the current round to construct optimistic parameter estimations that asymptotically concentrate around the true parameters as tighter upper confidence bounds compared to the ones constructed with standard UCB methods. We apply CO to a novel budgeted variant of the Influence Maximization (IM) semi-bandits with linear generalization of edge weights, whose offline problem is NP-hard. Combining CO with the oracle we design for the offline problem, our online learning algorithm simultaneously tackles budget allocation, parameter learning, and reward maximization. We show that for IM semi-bandits, our CO-based algorithm achieves a scaled regret comparable to that of the UCB-based algorithms in theory, and performs on par with Thompson Sampling in numerical experiments.

preprint2016arXiv

Instantaneous Control of Brownian Motion with a Positive Lead Time

Consider a storage system where the content is driven by a Brownian motion absent control. At any time, one may increase or decrease the content at a cost proportional to the amount of adjustment. A decrease of the content takes effect immediately, while an increase is realized after a fixed lead time $\lt$. Holding costs are incurred continuously over time and are a convex function of the content. The objective is to find a control policy that minimizes the expected present value of the total costs. Due to the positive lead time for upward adjustments, one needs to keep track of all the outstanding upward adjustments as well as the actual content at time $t$ as there may also be downward adjustments during $[t,t+\lt)$, i.e., the state of the system is a function on $[0,\ell]$. To the best of our knowledge, this is the first paper to study instantaneous control of stochastic systems in such a functional setting. We first extend the concept of $L^\natural$-convexity to function spaces and establish the $L^\natural$-convexity of the optimal cost function. We then derive various properties of the cost function and identify the structure of the optimal policy as a state-dependent two-sided reflection mapping making the minimum amount of adjustment necessary to keep the system states within a certain region.

preprint2016arXiv

Using Social Dynamics to Make Individual Predictions: Variational Inference with a Stochastic Kinetic Model

Social dynamics is concerned primarily with interactions among individuals and the resulting group behaviors, modeling the temporal evolution of social systems via the interactions of individuals within these systems. In particular, the availability of large-scale data from social networks and sensor networks offers an unprecedented opportunity to predict state-changing events at the individual level. Examples of such events include disease transmission, opinion transition in elections, and rumor propagation. Unlike previous research focusing on the collective effects of social systems, this study makes efficient inferences at the individual level. In order to cope with dynamic interactions among a large number of individuals, we introduce the stochastic kinetic model to capture adaptive transition probabilities and propose an efficient variational inference algorithm the complexity of which grows linearly --- rather than exponentially --- with the number of individuals. To validate this method, we have performed epidemic-dynamics experiments on wireless sensor network data collected from more than ten thousand people over three years. The proposed algorithm was used to track disease transmission and predict the probability of infection for each individual. Our results demonstrate that this method is more efficient than sampling while nonetheless achieving high accuracy.

Zhen Xu

What is connected

Connect this record

See the researcher in context

Building this map preview

19 published item(s)

Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning

Enhancing LLM-Based Data Annotation with Error Decomposition

Evaluating 21st-Century Competencies in Postsecondary Curricula with Large Language Models: Performance Benchmarking and Reasoning-Based Prompting Strategies

Temporal Regularization Training: Unleashing the Potential of Spiking Neural Networks

Bridging the Gap of AutoGraph between Academia and Industry: Analysing AutoGraph Challenge at KDD Cup 2020

Codabench: Flexible, Easy-to-Use and Reproducible Benchmarking Platform

Comparison of Spatio-Temporal Models for Human Motion and Pose Forecasting in Face-to-Face Interaction Scenarios

Confidence Propagation Cluster: Unleash Full Potential of Object Detectors

Didn't see that coming: a survey on non-verbal social human behavior forecasting

Winning solutions and post-challenge analyses of the ChaLearn AutoDL challenge 2019

Growth and Collapse of an Isolated Bubble Driven by a Single Negative Histotripsy Cycle in Agarose Gel: Stress, Strain, and Strain Rate Fields

NeoRL: A Near Real-World Benchmark for Offline Reinforcement Learning

Physics-based Constitutive Modeling of Photo-oxidative Aging in Semi-Crystalline Polymers based on Chemical Characterization Techniques

Acoustic measurements of the nucleus size distribution at the cavitation threshold

Flow Contrastive Estimation of Energy-Based Models

Learning the Graphical Structure of Electronic Health Records with Graph Convolutional Transformer

Online Learning with Cumulative Oversampling: Application to Budgeted Influence Maximization

Instantaneous Control of Brownian Motion with a Positive Lead Time

Using Social Dynamics to Make Individual Predictions: Variational Inference with a Stochastic Kinetic Model