Researcher profile

Tao Xiao

Tao Xiao contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 19 - UnverifiedVerification L1Unclaimed author
5works
0followers
4topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

5 published item(s)

preprint2026arXiv

Numerical Error Analysis of the Poisson Equation under RHS Inaccuracies in Particle-in-Cell Simulations

Particle-in-Cell (PIC) simulations rely on accurate solutions of the electrostatic Poisson equation, yet accuracy often deteriorates near irregular Dirichlet boundaries on Cartesian meshes. While much research has addressed discretization errors on the left-hand side (LHS) of the Poisson equation, the impact of right-hand-side (RHS) inaccuracies - arising from charge density sampling near boundaries in PIC methods - remains largely unexplored. This study analyzes the numerical errors induced by underestimated RHS values at near-boundary nodes when solving the Poisson equation using embedded boundary finite difference schemes with linear and quadratic treatments. Analytical derivations in one dimension and truncation error analyses in two dimensions reveal that such RHS inaccuracies modify local truncation behavior differently: they reduce the dominant truncation error in the linear scheme but introduce a zeroth-order term in the quadratic scheme, leading to larger global errors. Numerical experiments in one-, two-, and three-dimensional domains confirm these findings. Contrary to expectations, the linear scheme yields superior overall accuracy under typical PIC-induced RHS inaccuracies. A simple RHS calibration strategy is further proposed to restore the accuracy of the quadratic scheme. These results offer new insight into the interplay between boundary-induced RHS errors and discretization accuracy in Poisson-type problems.

preprint2025arXiv

On the Effectiveness of Training Data Optimization for LLM-based Code Generation: An Empirical Study

Large language models (LLMs) have achieved remarkable progress in code generation, largely driven by the availability of high-quality code datasets for effective training. To further improve data quality, numerous training data optimization techniques have been proposed; however, their overall effectiveness has not been systematically evaluated. To bridge this gap, we conduct the first large-scale empirical study, examining five widely-used training data optimization techniques and their pairwise combinations for LLM-based code generation across three benchmarks and four LLMs. Our results show that data synthesis is the most effective technique for improving functional correctness and reducing code smells, although it performs relatively worse on code maintainability compared to data refactoring, cleaning, and selection. Regarding combinations, we find that most combinations do not further improve functional correctness but can effectively enhance code quality (code smells and maintainability). Among all combinations, data synthesis combined with data refactoring achieves the strongest overall performance. Furthermore, our fine-grained analysis reinforces these findings and provides deeper insights into how individual techniques and their combinations influence code generation effectiveness. Overall, this work represents a first step toward a systematic understanding of training data optimization and combination strategies, offering practical guidance for future research and deployment in LLM-based code generation.

preprint2024arXiv

"My GitHub Sponsors profile is live!" Investigating the Impact of Twitter/X Mentions on GitHub Sponsors

GitHub Sponsors was launched in 2019, enabling donations to open-source software developers to provide financial support, as per GitHub's slogan: "Invest in the projects you depend on". However, a 2022 study on GitHub Sponsors found that only two-fifths of developers who were seeking sponsorship received a donation. The study found that, other than internal actions (such as offering perks to sponsors), developers had advertised their GitHub Sponsors profiles on social media, such as Twitter (also known as X). Therefore, in this work, we investigate the impact of tweets that contain links to GitHub Sponsors profiles on sponsorship, as well as their reception on Twitter/X. We further characterize these tweets to understand their context and find that (1) such tweets have the impact of increasing the number of sponsors acquired, (2) compared to other donation platforms such as Open Collective and Patreon, GitHub Sponsors has significantly fewer interactions but is more visible on Twitter/X, and (3) developers tend to contribute more to open-source software during the week of posting such tweets. Our findings are the first step toward investigating the impact of social media on obtaining funding to sustain open-source software.

preprint2022arXiv

GitHub Sponsors: Exploring a New Way to Contribute to Open Source

GitHub Sponsors, launched in 2019, enables donations to individual open source software (OSS) developers. Financial support for OSS maintainers and developers is a major issue in terms of sustaining OSS projects, and the ability to donate to individuals is expected to support the sustainability of developers, projects, and community. In this work, we conducted a mixed-methods study of GitHub Sponsors, including quantitative and qualitative analyses, to understand the characteristics of developers who are likely to receive donations and what developers think about donations to individuals. We found that: (1) sponsored developers are more active than non-sponsored developers, (2) the possibility to receive donations is related to whether there is someone in their community who is donating, and (3) developers are sponsoring as a new way to contribute to OSS. Our findings are the first step towards data-informed guidance for using GitHub Sponsors, opening up avenues for future work on this new way of financially sustaining the OSS community.

preprint2020arXiv

Tight Revenue Gaps among Simple Mechanisms

We consider a fundamental problem in microeconomics: selling a single item to a number of potential buyers, whose values are drawn from known independent and regular (not necessarily identical) distributions. There are four widely-used and widely-studied mechanisms in the literature: {\sf Myerson Auction}~({\sf OPT}), {\sf Sequential Posted-Pricing}~({\sf SPM}), {\sf Second-Price Auction with Anonymous Reserve}~({\sf AR}), and {\sf Anonymous Pricing}~({\sf AP}). {\sf OPT} is revenue-optimal but complicated, which also experiences several issues in practice such as fairness; {\sf AP} is the simplest mechanism, but also generates the lowest revenue among these four mechanisms; {\sf SPM} and {\sf AR} are of intermediate complexity and revenue. We explore revenue gaps among these mechanisms, each of which is defined as the largest ratio between revenues from a pair of mechanisms. We establish two tight bounds and one improved bound: 1. {\sf SPM} vs.\ {\sf AP}: this ratio studies the power of discrimination in pricing schemes. We obtain the tight ratio of $\mathcal{C^*} \approx 2.62$, closing the gap between $\big[\frac{e}{e - 1}, e\big]$ left before. 2. {\sf AR} vs.\ {\sf AP}: this ratio measures the relative power of auction scheme vs.\ pricing scheme, when no discrimination is allowed. We attain the tight ratio of $\frac{π^2}{6} \approx 1.64$, closing the previously known bounds $\big[\frac{e}{e - 1}, e\big]$. 3. {\sf OPT} vs.\ {\sf AR}: this ratio quantifies the power of discrimination in auction schemes, and is previously known to be somewhere between $\big[2, e\big]$. The lower-bound of $2$ was conjectured to be tight by Hartline and Roughgarden (2009) and Alaei et al.\ (2015). We acquire a better lower-bound of $2.15$, and thus disprove this conjecture.