Source author record

Donghong Cai

Donghong Cai appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Artificial Intelligence Computation and Language Machine Learning eess.SP Information Theory math.IT

Catalog footprint

What is connected

3works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration

Reinforcement learning with verifiable rewards, particularly Group Relative Policy Optimization (GRPO), has significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, in complex tasks, GRPO frequently suffers from the ``zero-advantage problem'': when all sampled rollouts for a query fail, the relative advantage collapses to zero. Consequently, the model loses effective training signals for these questions, wasting the training data and computational budget. While simply increasing the sampling budget for these questions is a common remedy, the static sampling policy inherently constrains reasoning exploration, limiting the success rate. In this paper, we propose Lorem Perturbation for Exploration (LoPE), a simple yet effective training framework to break this exploration bottleneck. We posit that task-irrelevant prompt-space perturbations can shift the model's output distribution enough to unlock orthogonal reasoning pathways for hard questions. Specifically, LoPE prepends sequences stochastically assembled from Lorem Ipsum vocabulary (a pseudo-Latin placeholder text) to the prompts before resampling. Experiments across 1.7B, 4B, and 7B models demonstrate that LoPE significantly outperforms resampling with the original prompts. Further analysis reveals that other Latin-based random sequences with low perplexity are also effective perturbations. Our results establish LoPE as a strong baseline for broadening exploration in LLM reinforcement learning.

preprint2026arXiv

Process Rewards with Learned Reliability

Process Reward Models (PRMs) provide step-level feedback for reasoning, but current PRMs usually output only a single reward score for each step. Downstream methods must therefore treat imperfect step-level reward predictions as reliable decision signals, with no indication of when these predictions should be trusted. We propose BetaPRM, a distributional PRM that predicts both a step-level success probability and the reliability of that prediction. Given step-success supervision from Monte Carlo continuations, BetaPRM learns a Beta belief that explains the observed number of successful continuations through a Beta-Binomial likelihood, rather than regressing to the finite-sample success ratio as a point target. This learned reliability signal indicates when a step reward should be trusted, enabling downstream applications to distinguish reliable rewards from uncertain ones. As one application, we introduce Adaptive Computation Allocation (ACA) for PRM-guided Best-of-N reasoning. ACA uses the learned reliability signal to stop when a high-reward solution is reliable and to spend additional computation on uncertain candidate prefixes. Experiments across four backbones and four reasoning benchmarks show that BetaPRM improves PRM-guided Best-of-N selection while preserving standard step-level error detection. Built on this signal, ACA improves the accuracy--token tradeoff over fixed-budget Best-of-16, reducing token usage by up to 33.57% while improving final-answer accuracy.

preprint2020arXiv

Intelligent User Clustering and Robust Beamforming Design for UAV-NOMA Downlink

In this work, we consider a downlink NOMA network with multiple single-antenna users and multi-antenna UAVs. In particular, the users are spatially located in several clusters by following the Poisson Cluster Process and each cluster is served by a hovering UAV with NOMA. For practical considerations, we assume that only imperfect CSI of each user is available at the UAVs. Based on this model, the problem of joint user clustering and robust beamforming design is formulated to minimize the sum transmission power, and meanwhile, guarantee the QoS requirements of users. Due to the integer variables of user clustering, coupling effects of beamformers, and infinitely many constraints caused by the imperfect CSI, the formulated problem is challenging to solve. For computational complexity reduction, the original problem is divided into user clustering subproblem and robust beamforming design subproblem. By utilizing the users' position information, we propose a k-means++ based unsupervised clustering algorithm to first deal with the user clustering problem. Then, we focus on the robust beamforming design problem. To attain insights on solving the robust beamforming design problem, we firstly investigate the problem with perfect CSI, and the associated problem is shown can be solved optimally. Secondly, for the problem in the general case with imperfect CSI, an SDR based method is proposed to produce a suboptimal solution efficiently. Moreover, we provide a sufficient condition under which the SDR based approach can guarantee to obtain an optimal rank-one solution, which is theoretically analyzed. Finally, an alternating direction method of multipliers based algorithm is proposed to allow the UAVs to perform robust beamforming design in a decentralized fashion efficiently. Simulation results demonstrate the efficacy of the proposed algorithms and transmission scheme.

Institution

Affiliation not imported yet

This author record came from a source that does not expose affiliation metadata. Once the author claims the profile or we enrich the record from another provider, this section will link to the concrete institution.

Topic footprint