Researcher profile

Yipeng Zhou

Yipeng Zhou contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
6works
0followers
8topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

6 published item(s)

preprint2026arXiv

LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models

Group Relative Policy Optimization(GRPO) has become a cornerstone of modern reinforcement learning alignment, prized for its efficacy in foregoing an explicit value-critic by leveraging reward normalization across sampled trajectory cohorts. However, the method's reliance on a monolithic statistical baseline, such as the group mean, collapses the relational topology of the trajectory space into a single scalar, thereby erasing the fine-grained preference information essential for navigating complex, rank-sensitive reward landscapes. To address this issue, we introduce a novel framework, Lambda Policy Optimization (LambdaPO), that addresses this information-theoretic bottleneck by re-conceptualizing advantage estimation from a scalar value to a decomposed, pairwise preference structure. Specifically, the advantage for any given trajectory is formulated as the integrated sum of reward differentials against all peers in its cohort, where each pairwise comparison is dynamically attenuated by the policy's own probabilistic confidence in the established preference. To further mitigate the sparsity of binary outcome supervision, we augment the objective with a semantic density reward, derived from the precision-recall alignment between generated reasoning traces and ground-truth solutions. As a result, our method can mine more fine-grained optimization signals from a group of rollouts, guiding the LLM to a better optima. Experimental results across challenging math reasoning and question-answering tasks demonstrates that LambdaPO improves performance compared to the baseline methods.

preprint2022arXiv

A Fast Blockchain-based Federated Learning Framework with Compressed Communications

Recently, blockchain-based federated learning (BFL) has attracted intensive research attention due to that the training process is auditable and the architecture is serverless avoiding the single point failure of the parameter server in vanilla federated learning (VFL). Nevertheless, BFL tremendously escalates the communication traffic volume because all local model updates (i.e., changes of model parameters) obtained by BFL clients will be transmitted to all miners for verification and to all clients for aggregation. In contrast, the parameter server and clients in VFL only retain aggregated model updates. Consequently, the huge communication traffic in BFL will inevitably impair the training efficiency and hinder the deployment of BFL in reality. To improve the practicality of BFL, we are among the first to propose a fast blockchain-based communication-efficient federated learning framework by compressing communications in BFL, called BCFL. Meanwhile, we derive the convergence rate of BCFL with non-convex loss. To maximize the final model accuracy, we further formulate the problem to minimize the training loss of the convergence rate subject to a limited training time with respect to the compression rate and the block generation rate, which is a bi-convex optimization problem and can be efficiently solved. To the end, to demonstrate the efficiency of BCFL, we carry out extensive experiments with standard CIFAR-10 and FEMNIST datasets. Our experimental results not only verify the correctness of our analysis, but also manifest that BCFL can remarkably reduce the communication traffic by 95-98% or shorten the training time by 90-95% compared with BFL.

preprint2022arXiv

Deep Hierarchy Quantization Compression algorithm based on Dynamic Sampling

Unlike traditional distributed machine learning, federated learning stores data locally for training and then aggregates the models on the server, which solves the data security problem that may arise in traditional distributed machine learning. However, during the training process, the transmission of model parameters can impose a significant load on the network bandwidth. It has been pointed out that the vast majority of model parameters are redundant during model parameter transmission. In this paper, we explore the data distribution law of selected partial model parameters on this basis, and propose a deep hierarchical quantization compression algorithm, which further compresses the model and reduces the network load brought by data transmission through the hierarchical quantization of model parameters. And we adopt a dynamic sampling strategy for the selection of clients to accelerate the convergence of the model. Experimental results on different public datasets demonstrate the effectiveness of our algorithm.

preprint2021arXiv

Gain without Pain: Offsetting DP-injected Nosies Stealthily in Cross-device Federated Learning

Federated Learning (FL) is an emerging paradigm through which decentralized devices can collaboratively train a common model. However, a serious concern is the leakage of privacy from exchanged gradient information between clients and the parameter server (PS) in FL. To protect gradient information, clients can adopt differential privacy (DP) to add additional noises and distort original gradients before they are uploaded to the PS. Nevertheless, the model accuracy will be significantly impaired by DP noises, making DP impracticable in real systems. In this work, we propose a novel Noise Information Secretly Sharing (NISS) algorithm to alleviate the disturbance of DP noises by sharing negated noises among clients. We theoretically prove that: 1) If clients are trustworthy, DP noises can be perfectly offset on the PS; 2) Clients can easily distort negated DP noises to protect themselves in case that other clients are not totally trustworthy, though the cost lowers model accuracy. NISS is particularly applicable for FL across multiple IoT (Internet of Things) systems, in which all IoT devices need to collaboratively train a model. To verify the effectiveness and the superiority of the NISS algorithm, we conduct experiments with the MNIST and CIFAR-10 datasets. The experiment results verify our analysis and demonstrate that NISS can improve model accuracy by 21% on average and obtain better privacy protection if clients are trustworthy.

preprint2021arXiv

On the Practicality of Differential Privacy in Federated Learning by Tuning Iteration Times

In spite that Federated Learning (FL) is well known for its privacy protection when training machine learning models among distributed clients collaboratively, recent studies have pointed out that the naive FL is susceptible to gradient leakage attacks. In the meanwhile, Differential Privacy (DP) emerges as a promising countermeasure to defend against gradient leakage attacks. However, the adoption of DP by clients in FL may significantly jeopardize the model accuracy. It is still an open problem to understand the practicality of DP from a theoretic perspective. In this paper, we make the first attempt to understand the practicality of DP in FL through tuning the number of conducted iterations. Based on the FedAvg algorithm, we formally derive the convergence rate with DP noises in FL. Then, we theoretically derive: 1) the conditions for the DP based FedAvg to converge as the number of global iterations (GI) approaches infinity; 2) the method to set the number of local iterations (LI) to minimize the negative influence of DP noises. By further substituting the Laplace and Gaussian mechanisms into the derived convergence rate respectively, we show that: 3) The DP based FedAvg with the Laplace mechanism cannot converge, but the divergence rate can be effectively prohibited by setting the number of LIs with our method; 4) The learning error of the DP based FedAvg with the Gaussian mechanism can converge to a constant number finally if we use a fixed number of LIs per GI. To verify our theoretical findings, we conduct extensive experiments using two real-world datasets. The results not only validate our analysis results, but also provide useful guidelines on how to optimize model accuracy when incorporating DP into FL

preprint2021arXiv

Optimizing Video Caching at the Edge: A Hybrid Multi-Point Process Approach

It is always a challenging problem to deliver a huge volume of videos over the Internet. To meet the high bandwidth and stringent playback demand, one feasible solution is to cache video contents on edge servers based on predicted video popularity. Traditional caching algorithms (e.g., LRU, LFU) are too simple to capture the dynamics of video popularity, especially long-tailed videos. Recent learning-driven caching algorithms (e.g., DeepCache) show promising performance, however, such black-box approaches are lack of explainability and interpretability. Moreover, the parameter tuning requires a large number of historical records, which are difficult to obtain for videos with low popularity. In this paper, we optimize video caching at the edge using a white-box approach, which is highly efficient and also completely explainable. To accurately capture the evolution of video popularity, we develop a mathematical model called \emph{HRS} model, which is the combination of multiple point processes, including Hawkes' self-exciting, reactive and self-correcting processes. The key advantage of the HRS model is its explainability, and much less number of model parameters. In addition, all its model parameters can be learned automatically through maximizing the Log-likelihood function constructed by past video request events. Next, we further design an online HRS-based video caching algorithm. To verify its effectiveness, we conduct a series of experiments using real video traces collected from Tencent Video, one of the largest online video providers in China. Experiment results demonstrate that our proposed algorithm outperforms the state-of-the-art algorithms, with 12.3\% improvement on average in terms of cache hit rate under realistic settings.