Source author record

John M. Cioffi

John M. Cioffi appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Artificial Intelligence Computation and Language eess.SY Information Theory math.IT Systems and Control

Catalog footprint

What is connected

4works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Epistemic Uncertainty for Test-Time Discovery

Automated scientific discovery using large language models relies on identifying genuinely novel solutions. Standard reinforcement learning penalizes high-variance mutations, which leads the policy to prioritize familiar patterns. As a result, the maximum reward plateaus even as the average reward increases. Overcoming this limitation requires a signal that distinguishes unexplored regions from intrinsically difficult problems. This necessitates measuring disagreement across independently adapted weight hypotheses rather than relying on a single network's confidence. UG-TTT addresses this challenge by maintaining a small ensemble of low-rank adapters over a frozen base model. The per-token disagreement, quantified as the mutual information between ensemble predictions and weight hypotheses, isolates epistemic uncertainty and identifies positions where insufficient coverage leads to adapter divergence rather than intrinsic problem difficulty. This measure is incorporated as an exploration bonus into the policy gradient, directing the policy toward positions where persistent adapter disagreement signals low training coverage, the same frontier where genuine discovery is possible. A nuclear norm regularizer ensures the adapters remain distinct from one another, thereby preserving the exploration signal throughout training. Across four scientific discovery benchmarks, UG-TTT increases the maximum reward on three tasks, maintains substantially higher solution diversity, and an ablation study confirms that the regularizer is essential for sustaining this behavior.

preprint2026arXiv

General Preference Reinforcement Learning

Post-training has split large language model (LLM) alignment into two largely disconnected tracks. Online reinforcement learning (RL) with verifiable rewards drives emergent reasoning on math and code but depends on a programmatic verifier that cannot reach open-ended tasks, while preference optimization handles open-ended generation yet forgoes the continuous exploration that powers online RL. Closing this gap requires a verifier for open-ended quality, but a scalar reward model is the wrong shape for the job. Quality is multi-dimensional, and any scalar score is an incomplete proxy that lets online RL collapse onto whichever axis the score is most sensitive to. We turn instead to the General Preference Model (GPM), which embeds responses into $k$ skew-symmetric subspaces and represents preference as a structured, intransitivity-aware comparison. Building on this, we propose General Preference Reinforcement Learning (GPRL), which carries the $k$-way structure through to the policy update. GPRL computes per-dimension group-relative advantages, normalizes each on its own scale so no axis can dominate, and aggregates them with context-dependent eigenvalues. The same structure powers a closed-loop drift monitor that detects single-axis exploitation and corrects it on the fly by reweighting dimensions and tightening the trust region. Starting from $\texttt{Llama-3-8B-Instruct}$, GPRL reaches a length-controlled win rate of $56.51\%$ on AlpacaEval~2.0 while also outperforming SimPO and SPPO on Arena-Hard, MT-Bench, and WildBench by resisting reward hacking across extended training runs.

preprint2022arXiv

Spatial Reuse in Dense Wireless Areas: A Cross-layer Optimization Approach via ADMM

This paper introduces an efficient method for communication resource use in dense wireless areas where all nodes must communicate with a common destination node. The proposed method groups nodes based on their \newt{distance from the destination} and creates a structured multi-hop configuration in which each group can relay its neighbor's data. \newt{The large number of active radio nodes and the common direction of communication toward a single destination are exploited to reuse the limited spectrum resources in spatially separated groups}. Spectrum allocation constraints among groups are then embedded in a joint routing and resource allocation framework to optimize the route and amount of resources allocated to each node. \newt{The solution to this problem uses coordination among the lower-layers of the wireless-network protocol stack to outperform conventional approaches where these layers are decoupled. Furthermore, the structure of this problem is exploited to obtain} a semi-distributed optimization algorithm based on the alternating direction method of multipliers (ADMM) where each node can optimize its resources independently based on local channel information.

preprint2010arXiv

On the Generality of $1+\mathbf{i}$ as a Non-Norm Element

Full-rate space-time block codes with nonvanishing determinants have been extensively designed with cyclic division algebras. For these designs, smaller pairwise error probabilities of maximum likelihood detections require larger normalized diversity products, which can be obtained by choosing integer non-norm elements with smaller absolute values. All known methods have constructed $1+\bi$ and $2+\bi$ to be integer non-norm elements with the smallest absolute values over QAM for the number of transmit antennas $n$: $\{n:5\leq n\leq 40,8\nmid n\}$ and $\{n:5\leq n\leq 40,8\mid n\}$, respectively. Via explicit constructions, this paper proves that $1+\bi$ is an integer non-norm element with the smallest absolute value over QAM for every $n\geq 5$.