Researcher profile

John M. Cioffi

John M. Cioffi contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 17 - UnverifiedVerification L1Unclaimed author
4works
0followers
7topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

4 published item(s)

preprint2026arXiv

Epistemic Uncertainty for Test-Time Discovery

Automated scientific discovery using large language models relies on identifying genuinely novel solutions. Standard reinforcement learning penalizes high-variance mutations, which leads the policy to prioritize familiar patterns. As a result, the maximum reward plateaus even as the average reward increases. Overcoming this limitation requires a signal that distinguishes unexplored regions from intrinsically difficult problems. This necessitates measuring disagreement across independently adapted weight hypotheses rather than relying on a single network's confidence. UG-TTT addresses this challenge by maintaining a small ensemble of low-rank adapters over a frozen base model. The per-token disagreement, quantified as the mutual information between ensemble predictions and weight hypotheses, isolates epistemic uncertainty and identifies positions where insufficient coverage leads to adapter divergence rather than intrinsic problem difficulty. This measure is incorporated as an exploration bonus into the policy gradient, directing the policy toward positions where persistent adapter disagreement signals low training coverage, the same frontier where genuine discovery is possible. A nuclear norm regularizer ensures the adapters remain distinct from one another, thereby preserving the exploration signal throughout training. Across four scientific discovery benchmarks, UG-TTT increases the maximum reward on three tasks, maintains substantially higher solution diversity, and an ablation study confirms that the regularizer is essential for sustaining this behavior.

preprint2026arXiv

General Preference Reinforcement Learning

Post-training has split large language model (LLM) alignment into two largely disconnected tracks. Online reinforcement learning (RL) with verifiable rewards drives emergent reasoning on math and code but depends on a programmatic verifier that cannot reach open-ended tasks, while preference optimization handles open-ended generation yet forgoes the continuous exploration that powers online RL. Closing this gap requires a verifier for open-ended quality, but a scalar reward model is the wrong shape for the job. Quality is multi-dimensional, and any scalar score is an incomplete proxy that lets online RL collapse onto whichever axis the score is most sensitive to. We turn instead to the General Preference Model (GPM), which embeds responses into $k$ skew-symmetric subspaces and represents preference as a structured, intransitivity-aware comparison. Building on this, we propose General Preference Reinforcement Learning (GPRL), which carries the $k$-way structure through to the policy update. GPRL computes per-dimension group-relative advantages, normalizes each on its own scale so no axis can dominate, and aggregates them with context-dependent eigenvalues. The same structure powers a closed-loop drift monitor that detects single-axis exploitation and corrects it on the fly by reweighting dimensions and tightening the trust region. Starting from $\texttt{Llama-3-8B-Instruct}$, GPRL reaches a length-controlled win rate of $56.51\%$ on AlpacaEval~2.0 while also outperforming SimPO and SPPO on Arena-Hard, MT-Bench, and WildBench by resisting reward hacking across extended training runs.

preprint2022arXiv

Spatial Reuse in Dense Wireless Areas: A Cross-layer Optimization Approach via ADMM

This paper introduces an efficient method for communication resource use in dense wireless areas where all nodes must communicate with a common destination node. The proposed method groups nodes based on their \newt{distance from the destination} and creates a structured multi-hop configuration in which each group can relay its neighbor's data. \newt{The large number of active radio nodes and the common direction of communication toward a single destination are exploited to reuse the limited spectrum resources in spatially separated groups}. Spectrum allocation constraints among groups are then embedded in a joint routing and resource allocation framework to optimize the route and amount of resources allocated to each node. \newt{The solution to this problem uses coordination among the lower-layers of the wireless-network protocol stack to outperform conventional approaches where these layers are decoupled. Furthermore, the structure of this problem is exploited to obtain} a semi-distributed optimization algorithm based on the alternating direction method of multipliers (ADMM) where each node can optimize its resources independently based on local channel information.

preprint2010arXiv

On the Generality of $1+\mathbf{i}$ as a Non-Norm Element

Full-rate space-time block codes with nonvanishing determinants have been extensively designed with cyclic division algebras. For these designs, smaller pairwise error probabilities of maximum likelihood detections require larger normalized diversity products, which can be obtained by choosing integer non-norm elements with smaller absolute values. All known methods have constructed $1+\bi$ and $2+\bi$ to be integer non-norm elements with the smallest absolute values over QAM for the number of transmit antennas $n$: $\{n:5\leq n\leq 40,8\nmid n\}$ and $\{n:5\leq n\leq 40,8\mid n\}$, respectively. Via explicit constructions, this paper proves that $1+\bi$ is an integer non-norm element with the smallest absolute value over QAM for every $n\geq 5$.