Researcher profile

Ziteng Sun

Ziteng Sun contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
7works
0followers
9topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

7 published item(s)

preprint2026arXiv

CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization

Group Relative Policy Optimization (GRPO) has emerged as a powerful algorithm for improving the reasoning capabilities of language models, but often fails to improve small models due to sparse rewards on difficult tasks. Existing works mitigate this issue by leveraging a larger model, either to provide hints for rollouts or to provide dense reward signals through knowledge distillation (KD). However, this assumes the existence of such an oracle, and training one can significantly increase total training time. In this work, we propose CoDistill-GRPO, a co-distillation algorithm that simultaneously trains a large and a small model by maximizing carefully designed GRPO objectives. The two models learn from each other: the small model uses an on-policy KD reward to learn from the large model's distribution, while the large model is updated using rollouts generated by the small model with importance reweighting, reducing the computational overhead of rollout generation. We show that CoDistill-GRPO substantially improves small model performance over standard GRPO on mathematical benchmarks across both Qwen and Llama models. Specifically, with Qwen2.5-Math-1.5B, we observe an accuracy increase of over 11.6 percentage points over the base model and an additional 6.0 percentage points over GRPO on the Minerva dataset. Interestingly, the larger model (Qwen2.5-Math-7B) trained with CoDistill-GRPO nearly matches standard GRPO performance despite training on small-model rollouts. This highlights CoDistill-GRPO as a cost-effective alternative to GRPO for larger models, yielding an approximate 18% speedup, which may be of independent interest.

preprint2022arXiv

Correlated quantization for distributed mean estimation and optimization

We study the problem of distributed mean estimation and optimization under communication constraints. We propose a correlated quantization protocol whose leading term in the error guarantee depends on the mean deviation of data points rather than only their absolute range. The design doesn't need any prior knowledge on the concentration property of the dataset, which is required to get such dependence in previous works. We show that applying the proposed protocol as sub-routine in distributed optimization algorithms leads to better convergence rates. We also prove the optimality of our protocol under mild assumptions. Experimental results show that our proposed algorithm outperforms existing mean estimation protocols on a diverse set of tasks.

preprint2022arXiv

The Role of Interactivity in Structured Estimation

We study high-dimensional sparse estimation under three natural constraints: communication constraints, local privacy constraints, and linear measurements (compressive sensing). Without sparsity assumptions, it has been established that interactivity cannot improve the minimax rates of estimation under these information constraints. The question of whether interactivity helps with natural inference tasks has been a topic of active research. We settle this question in the affirmative for the prototypical problems of high-dimensional sparse mean estimation and compressive sensing, by demonstrating a gap between interactive and noninteractive protocols. We further establish that the gap increases when we have more structured sparsity: for block sparsity this gap can be as large as polynomial in the dimensionality. Thus, the more structured the sparsity is, the greater is the advantage of interaction. Proving the lower bounds requires a careful breaking of a sum of correlated random variables into independent components using Baranyai's theorem on decomposition of hypergraphs, which might be of independent interest.

preprint2021arXiv

Advances and Open Problems in Federated Learning

Federated learning (FL) is a machine learning setting where many clients (e.g. mobile devices or whole organizations) collaboratively train a model under the orchestration of a central server (e.g. service provider), while keeping the training data decentralized. FL embodies the principles of focused data collection and minimization, and can mitigate many of the systemic privacy risks and costs resulting from traditional, centralized machine learning and data science approaches. Motivated by the explosive growth in FL research, this paper discusses recent advances and presents an extensive collection of open problems and challenges.

preprint2021arXiv

Estimating Sparse Discrete Distributions Under Local Privacy and Communication Constraints

We consider the problem of estimating sparse discrete distributions under local differential privacy (LDP) and communication constraints. We characterize the sample complexity for sparse estimation under LDP constraints up to a constant factor and the sample complexity under communication constraints up to a logarithmic factor. Our upper bounds under LDP are based on the Hadamard Response, a private coin scheme that requires only one bit of communication per user. Under communication constraints, we propose public coin schemes based on random hashing functions. Our tight lower bounds are based on the recently proposed method of chi squared contractions.

preprint2021arXiv

Inference under Information Constraints III: Local Privacy Constraints

We study goodness-of-fit and independence testing of discrete distributions in a setting where samples are distributed across multiple users. The users wish to preserve the privacy of their data while enabling a central server to perform the tests. Under the notion of local differential privacy, we propose simple, sample-optimal, and communication-efficient protocols for these two questions in the noninteractive setting, where in addition users may or may not share a common random seed. In particular, we show that the availability of shared (public) randomness greatly reduces the sample complexity. Underlying our public-coin protocols are privacy-preserving mappings which, when applied to the samples, minimally contract the distance between their respective probability distributions.

preprint2020arXiv

Context-Aware Local Differential Privacy

Local differential privacy (LDP) is a strong notion of privacy for individual users that often comes at the expense of a significant drop in utility. The classical definition of LDP assumes that all elements in the data domain are equally sensitive. However, in many applications, some symbols are more sensitive than others. This work proposes a context-aware framework of local differential privacy that allows a privacy designer to incorporate the application's context into the privacy definition. For binary data domains, we provide a universally optimal privatization scheme and highlight its connections to Warner's randomized response (RR) and Mangat's improved response. Motivated by geolocation and web search applications, for $k$-ary data domains, we consider two special cases of context-aware LDP: block-structured LDP and high-low LDP. We study discrete distribution estimation and provide communication-efficient, sample-optimal schemes and information-theoretic lower bounds for both models. We show that using contextual information can require fewer samples than classical LDP to achieve the same accuracy.