Researcher profile

Xiao Tian

Xiao Tian contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 19 - UnverifiedVerification L1Unclaimed author
5works
0followers
4topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

5 published item(s)

preprint2026arXiv

Incentivizing Truthfulness and Collaborative Fairness in Bayesian Learning

Collaborative machine learning involves training high-quality models using datasets from a number of sources. To incentivize sources to share data, existing data valuation methods fairly reward each source based on its data submitted as is. However, as these methods do not verify nor incentivize data truthfulness, the sources can manipulate their data (e.g., by submitting duplicated or noisy data) to artificially increase their valuations and rewards or prevent others from benefiting. This paper presents the first mechanism that provably ensures (F) collaborative fairness and incentivizes (T) truthfulness at equilibrium for Bayesian models. Our mechanism combines semivalues (e.g., Shapley value), which ensure fairness, and a truthful data valuation function (DVF) based on a validation set that is unknown to the sources. As semivalues are influenced by others' data, we introduce an additional condition to prove that a source can maximize its expected data values in coalitions and semivalues by submitting a dataset that captures its true knowledge. Additionally, we discuss the implications and suitable relaxations of (F) and (T) when the mediator has a limited budget for rewards or lacks a validation set. Our theoretical findings are validated on synthetic and real-world datasets.

preprint2026arXiv

INO-SGD: Addressing Utility Imbalance under Individualized Differential Privacy

Differential privacy (DP) is widely employed in machine learning to protect confidential or sensitive training data from being revealed. As data owners gain greater control over their data due to personal data ownership, they are more likely to set their own privacy requirements, necessitating individualized DP (IDP) to fulfil such requests. In particular, owners of data from more sensitive subsets, such as positive cases of stigmatized diseases, likely set stronger privacy requirements, as leakage of such data could incur more serious societal impact. However, existing IDP algorithms induce a critical utility imbalance problem: Data from owners with stronger privacy requirements may be severely underrepresented in the trained model, resulting in poorer performance on similar data from subsequent users during deployment. In this paper, we analyze this problem and propose the INO-SGD algorithm, which strategically down-weights data within each batch to improve performance on the more private data across all iterations. Notably, our algorithm is specially designed to satisfy IDP, while existing techniques addressing utility imbalance neither satisfy IDP nor can be easily adapted to do so. Lastly, we demonstrate the empirical feasibility of our approach.

preprint2026arXiv

Is Data Shapley Not Better than Random in Data Selection? Ask NASH

Data selection studies the problem of identifying high-quality subsets of training data. While some existing works have considered selecting the subset of data with top-$m$ Data Shapley or other semivalues as they account for the interaction among every subset of data, other works argue that Data Shapley can sometimes perform ineffectively in practice and select subsets that are no better than random. This raises the questions: (I) Are there certain "Shapley-informative" settings where Data Shapley consistently works well? (II) Can we strategically utilize these settings to select high-quality subsets consistently and efficiently? In this paper, we propose a novel data selection framework, NASH (Non-linear Aggregation of SHapley-informative components), which (I) decomposes the target utility function (e.g., validation accuracy) into simpler, Shapley-informative component functions, and selects data by optimizing an objective that (II) aggregates these components non-linearly. We demonstrate that NASH substantially boosts the effectiveness of Shapley/semivalue-based data selection with minimal additional runtime cost.

preprint2021arXiv

Constraining the circumburst medium of gamma-ray bursts with X-ray afterglows

Long gamma-ray bursts (GRBs) are considered to be originated from core collapse of massive stars. It is believed that the afterglow property is determined by the density of the material in the surrounding interstellar medium. Therefore, the circumburst density can be used to distinguish between an interstellar wind $n(R) \propto R^{\rm -k}$, and a constant density medium (ISM), $n(R) = const$. Previous studies with different afterglow samples, show that the circumburst medium of GRBs is neither simply supported by an interstellar wind, nor completely favored by an ISM. In this work, our new sample is consisted of 39 GRBs with smoothly onset bump-like features in early X-ray afterglows, in which 20 GRBs have the redshift measurements. By using a smooth broken power law function to fit the bumps of X-ray light curves, we derive the full width at half-maximum (FWHM) as the feature width ($ω$), as well as the rise and decay time scales of the bumps ($T_{\rm r}$ and $T_{\rm d}$). The correlations between the timescales of X-ray bumps are similar to those found previously in the optical afterglows. Based on the fireball forward shock (FS) model of the thin shell case, we obtain the distribution of the electron spectral index $p$, and further constrain the medium density distribution index $k$. The new inferred $k$ is found to be concentrated at 1.0, with a range from 0.2 to 1.8. This finding is consistent with previous studies. The conclusion of our detailed investigation for X-ray afterglows suggests that the ambient medium of the selected GRBs is not homogeneous, i.e., neither ISM nor the typical interstellar wind.

preprint2020arXiv

Real-time Earthquake Early Warning with Deep Learning: Application to the 2016 Central Apennines, Italy Earthquake Sequence

Earthquake early warning systems are required to report earthquake locations and magnitudes as quickly as possible before the damaging S wave arrival to mitigate seismic hazards. Deep learning techniques provide potential for extracting earthquake source information from full seismic waveforms instead of seismic phase picks. We developed a novel deep learning earthquake early warning system that utilizes fully convolutional networks to simultaneously detect earthquakes and estimate their source parameters from continuous seismic waveform streams. The system determines earthquake location and magnitude as soon as one station receives earthquake signals and evolutionarily improves the solutions by receiving continuous data. We apply the system to the 2016 Mw 6.0 earthquake in Central Apennines, Italy and its subsequent sequence. Earthquake locations and magnitudes can be reliably determined as early as four seconds after the earliest P phase, with mean error ranges of 6.8-3.7 km and 0.31-0.23, respectively.