Paper detail

ReFlect: An Effective Harness System for Complex Long-Horizon LLM Reasoning

Current reasoning paradigms for LLMs include chain-of-thought, ReAct, and post-hoc self-critique. These paradigms rely on two assumptions that fail on long-horizon, multi-stage tasks. As a result, errors accumulate silently across reasoning steps, leaving an open question: can a reasoning system effectively detect and recover from its own failures? We present ReFlect, a \emph{harness} system for LLM reasoning that creates standalone error detection and recovery logic as a deterministic wrapper around the model. Controlled experiments across 6 reasoning domains show that prompt-level self-critique produces formulaic templates that flag no issues in 90 of 100 audited reflection blocks, and the investigated LLMs wrongly accept a wrong answer in at least 76\% of cases. Our ReFlect harness achieves task success rates ranging from 41\% on gpt-4o-mini to 56\% on Claude Sonnet 4.5 across six models spanning small and frontier scale, with per-model gains over Direct CoT ranging from +7 pp on Qwen2.5-72B to +29 pp on Claude Sonnet 4.5, and additionally raises SWE-bench patch-structural quality from 0\% (Direct CoT) to between 82\% (Qwen2.5-72B) and 87\% (GPT-4o). Notably, the harness gain is inversely proportional to the model's Direct CoT task success rate (the fitted slope is -1.69 with r=-0.76): each pp lost in baseline success rate is mechanically recovered by 1.69 pp of harness gain. We spot that adding structured reasoning state and operators yields only 15.0--18.7\% pair-mean on Llama-3.3-70B and Qwen2.5-72B because models at this scale cannot reliably populate the state its operators require. ReFlect is model-agnostic, training-free, and operates entirely at inference time.

preprint2026arXivOpen access
0citations
0reviews
0saves
Nocode
Nodataset
0institutions

Next steps

Decide what to do with this paper

Use like or dislike for the fast social read. The more specific scholarly feedback stays available below when needed.

Log in to curate

Reading frame

Keep the important context close to the paper

Keep the important signals around this paper in one place: votes, save state, collection context, reviews and the metadata you need before deciding what to do next.

Authors

Institutions

Add specific reaction

Move through the context

Research map

Open full explorer

Move through nearby people, institutions, topics and adjacent work without leaving the paper page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Structured reviews

0 review(s)

ContributeLeave structured feedbackUse the review template when you have a concrete strength, concern or method question.Open review form

No structured reviews yet. High-signal critique starts here.

Work discussion

0 comment(s)

DiscussAdd a high-signal commentKeep quick notes, caveats and replication pointers separate from formal reviews.Open comment form

No discussion yet. The first strong comment sets the tone.