Source author record

Xinyi Xu

Xinyi Xu appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Methodology cond-mat.mtrl-sci physics.optics Artificial Intelligence Computation and Language Computer Science and Game Theory cond-mat.mes-hall Distributed, Parallel, and Cluster Computing Information Theory math.IT math.ST physics.atom-ph Software Engineering Statistics Theory

Catalog footprint

What is connected

13works

15topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

From Failure to Mastery: Generating Hard Samples for Tool-use Agents

The advancement of LLM agents with tool-use capabilities requires diverse and complex training corpora. Existing data generation methods, which predominantly follow a paradigm of random sampling and shallow generation, often yield simple and homogeneous trajectories that fail to capture complex, implicit logical dependencies. To bridge this gap, we introduce HardGen, an automatic agentic pipeline designed to generate hard tool-use training samples with verifiable reasoning. Firstly, HardGen establishes a dynamic API Graph built upon agent failure cases, from which it samples to synthesize hard traces. Secondly, these traces serve as conditional priors to guide the instantiation of modular, abstract advanced tools, which are subsequently leveraged to formulate hard queries. Finally, the advanced tools and hard queries enable the generation of verifiable complex Chain-of-Thought (CoT), with a closed-loop evaluation feedback steering the continuous refinement of the process. Extensive evaluations demonstrate that a 4B parameter model trained with our curated dataset achieves superior performance compared to several leading open-source and closed-source competitors (e.g., GPT-5.2, Gemini-3-Pro and Claude-Opus-4.5). Our code, models, and dataset will be open-sourced to facilitate future research.

preprint2026arXiv

Incentivizing Truthfulness and Collaborative Fairness in Bayesian Learning

Collaborative machine learning involves training high-quality models using datasets from a number of sources. To incentivize sources to share data, existing data valuation methods fairly reward each source based on its data submitted as is. However, as these methods do not verify nor incentivize data truthfulness, the sources can manipulate their data (e.g., by submitting duplicated or noisy data) to artificially increase their valuations and rewards or prevent others from benefiting. This paper presents the first mechanism that provably ensures (F) collaborative fairness and incentivizes (T) truthfulness at equilibrium for Bayesian models. Our mechanism combines semivalues (e.g., Shapley value), which ensure fairness, and a truthful data valuation function (DVF) based on a validation set that is unknown to the sources. As semivalues are influenced by others' data, we introduce an additional condition to prove that a source can maximize its expected data values in coalitions and semivalues by submitting a dataset that captures its true knowledge. Additionally, we discuss the implications and suitable relaxations of (F) and (T) when the mediator has a limited budget for rewards or lacks a validation set. Our theoretical findings are validated on synthetic and real-world datasets.

preprint2022arXiv

On the Convergence of the Shapley Value in Parametric Bayesian Learning Games

Measuring contributions is a classical problem in cooperative game theory where the Shapley value is the most well-known solution concept. In this paper, we establish the convergence property of the Shapley value in parametric Bayesian learning games where players perform a Bayesian inference using their combined data, and the posterior-prior KL divergence is used as the characteristic function. We show that for any two players, under some regularity conditions, their difference in Shapley value converges in probability to the difference in Shapley value of a limiting game whose characteristic function is proportional to the log-determinant of the joint Fisher information. As an application, we present an online collaborative learning framework that is asymptotically Shapley-fair. Our result enables this to be achieved without any costly computations of posterior-prior KL divergences. Only a consistent estimator of the Fisher information is needed. The effectiveness of our framework is demonstrated with experiments using real-world data.

preprint2021arXiv

Industry Practice of Coverage-Guided Enterprise-Level DBMS Fuzzing

As an infrastructure for data persistence and analysis, Database Management Systems (DBMSs) are the cornerstones of modern enterprise software. To improve their correctness, the industry has been applying blackbox fuzzing for decades. Recently, the research community achieved impressive fuzzing gains using coverage guidance. However, due to the complexity and distributed nature of enterprise-level DBMSs, seldom are these researches applied to the industry. In this paper, we apply coverage-guided fuzzing to enterprise-level DBMSs from Huawei and Bloomberg LP. In our practice of testing GaussDB and Comdb2, we found major challenges in all three testing stages. The challenges are collecting precise coverage, optimizing fuzzing performance, and analyzing root causes. In search of a general method to overcome these challenges, we propose Ratel, a coverage-guided fuzzer for enterprise-level DBMSs. With its industry-oriented design, Ratel improves the feedback precision, enhances the robustness of input generation, and performs an on-line investigation on the root cause of bugs. As a result, Ratel outperformed other fuzzers in terms of coverage and bugs. Compared to industrial black box fuzzers SQLsmith and SQLancer, as well as coverage-guided academic fuzzer Squirrel, Ratel covered 38.38%, 106.14%, 583.05% more basic blocks than the best results of other three fuzzers in GaussDB, PostgreSQL, and Comdb2, respectively. More importantly, Ratel has discovered 32, 42, and 5 unknown bugs in GaussDB, Comdb2, and PostgreSQL.

preprint2020arXiv

Collaborative Fairness in Federated Learning

In current deep learning paradigms, local training or the Standalone framework tends to result in overfitting and thus poor generalizability. This problem can be addressed by Distributed or Federated Learning (FL) that leverages a parameter server to aggregate model updates from individual participants. However, most existing Distributed or FL frameworks have overlooked an important aspect of participation: collaborative fairness. In particular, all participants can receive the same or similar models, regardless of their contributions. To address this issue, we investigate the collaborative fairness in FL, and propose a novel Collaborative Fair Federated Learning (CFFL) framework which utilizes reputation to enforce participants to converge to different models, thus achieving fairness without compromising the predictive performance. Extensive experiments on benchmark datasets demonstrate that CFFL achieves high fairness, delivers comparable accuracy to the Distributed framework, and outperforms the Standalone framework.

preprint2020arXiv

Spin Wave Generation via Localized Spin-Orbit Torque in an Antiferromagnet-Topological Insulator Heterostructure

The spin-orbit torque induced by a topological insulator (TI) is theoretically examined for spin wave generation in a neighboring antiferromagnetic thin film. The investigation is based on the micromagnetic simulation of Néel vector dynamics and the analysis of transport properties in the TI. The results clearly illustrate that propagating spin waves can be achieved in the antiferromagnetic thin-film strip through localized excitation, traveling over a long distance. The oscillation amplitude gradually decays due to the non-zero damping as the Néel vector precesses around the magnetic easy axis with a fixed frequency. The frequency is also found to be tunable via the strength of the driving electrical current density. While both the bulk and the surface states of the TI contribute to induce the effective torque, the calculation indicates that the surface current plays a dominant role over the bulk counterpart except in the heavily degenerate cases. Compared to the more commonly applied heavy metals, the use of a TI can substantially reduce the threshold current density to overcome the magnetic anisotropy, making it an efficient choice for spin wave generation. The Néel vector dynamics in the nano-oscillator geometry are examined as well.

preprint2019arXiv

Broadband optical parametric amplification by two-dimensional semiconductors

Optical parametric amplification is a second-order nonlinear process whereby an optical signal is amplified by a pump via the generation of an idler field. It is the key ingredient of tunable sources of radiation that play an important role in several photonic applications. This mechanism is inherently related to spontaneous parametric down-conversion that currently constitutes the building block for entangled photon pair generation, which has been exploited in modern quantum technologies ranging from computing to communications and cryptography. Here we demonstrate single-pass optical parametric amplification at the ultimate thickness limit; using semiconducting transition-metal dichalcogenides, we show that amplification can be attained over a propagation through a single atomic layer. Such a second-order nonlinear interaction at the 2D limit bypasses phase-matching requirements and achieves ultrabroad amplification bandwidths. The amplification process is independent on the in-plane polarization of the impinging signal and pump fields. First-principle calculations confirm the observed polarization invariance and linear relationship between idler and pump powers. Our results pave the way for the development of atom-sized tunable sources of radiation with applications in nanophotonics and quantum information technology.

preprint2016arXiv

A Correlation Analysis Method for Power Systems Based on Random Matrix Theory

The operating status of power systems is influenced by growing varieties of factors, resulting from the developing sizes and complexity of power systems; in this situation, the modelbased methods need be revisited. A data-driven method, as the novel alternative, on the other hand, is proposed in this paper: it reveals the correlations between the factors and the system status through statistical properties of data. An augmented matrix, as the data source, is the key trick for this method; it is formulated by two parts: 1) status data as the basic part, and 2) factor data as the augmented part. The random matrix theory (RMT) is applied as the mathematical framework. The linear eigenvalue statistics (LESs), such as the mean spectral radius (MSR), are defined to study data correlations through large random matrices. Compared with model-based methods, the proposed method is inspired by a pure statistical approach, without a prior knowledge of operation and interaction mechanism models for power systems and factors. In general, this method is direct in analysis, robust against bad data, universal to various factors, and applicable for real-time analysis. A case study, based on the standard IEEE 118-bus system, validates the proposed method.

preprint2015arXiv

3D Power-map for Smart Grids---An Integration of High-dimensional Analysis and Visualization

Data with features of volume, velocity, variety, and veracity are challenging traditional tools to extract useful analysis for decision-making. By integrating high-dimensional analysis with visualization, this paper develops a 3D power-map animation as an effective solution to the challenge. An architecture design, with detailed data processing procedure, is proposed to realize the integration. Two of the most important components in the architecture are presented: the Single-Ring Law for random matrices as solid mathematic foundation, and the proposed statistical index MSR as high-dimensional data for visualization. The whole procedure is easy in logic, fast in speed, objective and even robust against bad data. Moreover, it is an unsupervised machine learning mechanism directly oriented to the raw data rather than logics or models based on simplifications and assumptions. A case study validates the effectiveness and performance of the developed 3D power-map in analysis extraction.

preprint2015arXiv

Is the Low-Complexity Mobile-Relay-Aided FFR-DAS Capable of Outperforming the High-Complexity CoMP?

Coordinated multi-point transmission/reception aided collocated antenna system (CoMP-CAS) and mobile relay assisted fractional frequency reuse distributed antenna system (MR-FFR-DAS) constitute a pair of virtual-MIMO based technical options for achieving high spectral efficiency in interference-limited cellular networks. In practice both techniques have their respective pros and cons, which are studied in this paper by evaluating the achievable cell-edge performance on the uplink of multicell systems. We show that assuming the same antenna configuration in both networks, the maximum available cooperative spatial diversity inherent in the MR-FFR-DAS is lower than that of the CoMP-CAS. However, when the cell-edge MSs have a low transmission power, the lower-complexity MR-FFR-DAS relying on the simple single-cell processing may outperform the CoMP-CAS by using the proposed soft-combining based probabilistic data association (SC-PDA) receiver, despite the fact that the latter scheme is more complex and incurs a higher cooperation overhead. Furthermore, the benefits of the SC-PDA receiver may be enhanced by properly selecting the MRs' positions. Additionally, we show that the performance of the cell-edge MSs roaming near the angular direction halfway between two adjacent RAs (i.e. the "worst-case direction") of the MR-FFR-DAS may be more significantly improved than that of the cell-edge MSs of other directions by using multiuser power control, which also improves the fairness amongst cell-edge MSs. Our simulation results show that given a moderate MS transmit power, the proposed MR-FFR-DAS architecture employing the SC-PDA receiver is capable of achieving significantly better bit-error rate (BER) and effective throughput across the entire cell-edge area, including even the "worst-case direction" and the cell-edge boundary, than the CoMP-CAS architecture.

preprint2015arXiv

Multi-photon Absorption in Optical Pumping of Rubidium

In optical pumping of rubidium, a new kind of absorption occurs with a higher amplitude of radio frequency current. From measurement of the corresponding magnetic field value where this absorption occurs, there is a conclusion that it is multi-photon absorption. Both the degeneracy and energy of photons contribute to the intensity.

preprint2012arXiv

From Minimax Shrinkage Estimation to Minimax Shrinkage Prediction

In a remarkable series of papers beginning in 1956, Charles Stein set the stage for the future development of minimax shrinkage estimators of a multivariate normal mean under quadratic loss. More recently, parallel developments have seen the emergence of minimax shrinkage estimators of multivariate normal predictive densities under Kullback--Leibler risk. We here describe these parallels emphasizing the focus on Bayes procedures and the derivation of the superharmonic conditions for minimaxity as well as further developments of new minimax shrinkage predictive density estimators including multiple shrinkage estimators, empirical Bayes estimators, normal linear model regression estimators and nonparametric regression estimators.

preprint2010arXiv

Asymptotic minimax risk of predictive density estimation for non-parametric regression

We consider the problem of estimating the predictive density of future observations from a non-parametric regression model. The density estimators are evaluated under Kullback--Leibler divergence and our focus is on establishing the exact asymptotics of minimax risk in the case of Gaussian errors. We derive the convergence rate and constant for minimax risk among Bayesian predictive densities under Gaussian priors and we show that this minimax risk is asymptotically equivalent to that among all density estimators.

Xinyi Xu

What is connected

Connect this record

See the researcher in context

Building this map preview

13 published item(s)

From Failure to Mastery: Generating Hard Samples for Tool-use Agents

Incentivizing Truthfulness and Collaborative Fairness in Bayesian Learning

On the Convergence of the Shapley Value in Parametric Bayesian Learning Games

Industry Practice of Coverage-Guided Enterprise-Level DBMS Fuzzing

Collaborative Fairness in Federated Learning

Spin Wave Generation via Localized Spin-Orbit Torque in an Antiferromagnet-Topological Insulator Heterostructure

Broadband optical parametric amplification by two-dimensional semiconductors

A Correlation Analysis Method for Power Systems Based on Random Matrix Theory

3D Power-map for Smart Grids---An Integration of High-dimensional Analysis and Visualization

Is the Low-Complexity Mobile-Relay-Aided FFR-DAS Capable of Outperforming the High-Complexity CoMP?

Multi-photon Absorption in Optical Pumping of Rubidium

From Minimax Shrinkage Estimation to Minimax Shrinkage Prediction

Asymptotic minimax risk of predictive density estimation for non-parametric regression