Researcher profile

Donghong Cai

Donghong Cai contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 15 - UnverifiedVerification L1Unclaimed author
3works
0followers
6topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

3 published item(s)

preprint2026arXiv

Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration

Reinforcement learning with verifiable rewards, particularly Group Relative Policy Optimization (GRPO), has significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, in complex tasks, GRPO frequently suffers from the ``zero-advantage problem'': when all sampled rollouts for a query fail, the relative advantage collapses to zero. Consequently, the model loses effective training signals for these questions, wasting the training data and computational budget. While simply increasing the sampling budget for these questions is a common remedy, the static sampling policy inherently constrains reasoning exploration, limiting the success rate. In this paper, we propose Lorem Perturbation for Exploration (LoPE), a simple yet effective training framework to break this exploration bottleneck. We posit that task-irrelevant prompt-space perturbations can shift the model's output distribution enough to unlock orthogonal reasoning pathways for hard questions. Specifically, LoPE prepends sequences stochastically assembled from Lorem Ipsum vocabulary (a pseudo-Latin placeholder text) to the prompts before resampling. Experiments across 1.7B, 4B, and 7B models demonstrate that LoPE significantly outperforms resampling with the original prompts. Further analysis reveals that other Latin-based random sequences with low perplexity are also effective perturbations. Our results establish LoPE as a strong baseline for broadening exploration in LLM reinforcement learning.

preprint2026arXiv

Process Rewards with Learned Reliability

Process Reward Models (PRMs) provide step-level feedback for reasoning, but current PRMs usually output only a single reward score for each step. Downstream methods must therefore treat imperfect step-level reward predictions as reliable decision signals, with no indication of when these predictions should be trusted. We propose BetaPRM, a distributional PRM that predicts both a step-level success probability and the reliability of that prediction. Given step-success supervision from Monte Carlo continuations, BetaPRM learns a Beta belief that explains the observed number of successful continuations through a Beta-Binomial likelihood, rather than regressing to the finite-sample success ratio as a point target. This learned reliability signal indicates when a step reward should be trusted, enabling downstream applications to distinguish reliable rewards from uncertain ones. As one application, we introduce Adaptive Computation Allocation (ACA) for PRM-guided Best-of-N reasoning. ACA uses the learned reliability signal to stop when a high-reward solution is reliable and to spend additional computation on uncertain candidate prefixes. Experiments across four backbones and four reasoning benchmarks show that BetaPRM improves PRM-guided Best-of-N selection while preserving standard step-level error detection. Built on this signal, ACA improves the accuracy--token tradeoff over fixed-budget Best-of-16, reducing token usage by up to 33.57% while improving final-answer accuracy.

preprint2020arXiv

Intelligent User Clustering and Robust Beamforming Design for UAV-NOMA Downlink

In this work, we consider a downlink NOMA network with multiple single-antenna users and multi-antenna UAVs. In particular, the users are spatially located in several clusters by following the Poisson Cluster Process and each cluster is served by a hovering UAV with NOMA. For practical considerations, we assume that only imperfect CSI of each user is available at the UAVs. Based on this model, the problem of joint user clustering and robust beamforming design is formulated to minimize the sum transmission power, and meanwhile, guarantee the QoS requirements of users. Due to the integer variables of user clustering, coupling effects of beamformers, and infinitely many constraints caused by the imperfect CSI, the formulated problem is challenging to solve. For computational complexity reduction, the original problem is divided into user clustering subproblem and robust beamforming design subproblem. By utilizing the users' position information, we propose a k-means++ based unsupervised clustering algorithm to first deal with the user clustering problem. Then, we focus on the robust beamforming design problem. To attain insights on solving the robust beamforming design problem, we firstly investigate the problem with perfect CSI, and the associated problem is shown can be solved optimally. Secondly, for the problem in the general case with imperfect CSI, an SDR based method is proposed to produce a suboptimal solution efficiently. Moreover, we provide a sufficient condition under which the SDR based approach can guarantee to obtain an optimal rank-one solution, which is theoretically analyzed. Finally, an alternating direction method of multipliers based algorithm is proposed to allow the UAVs to perform robust beamforming design in a decentralized fashion efficiently. Simulation results demonstrate the efficacy of the proposed algorithms and transmission scheme.