Paper detail

Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation

A key goal of explainable AI (XAI) is to express the decision logic of large language models (LLMs) in symbolic form and link it to internal mechanisms. Global rule-extraction methods typically learn symbolic surrogates without grounding rules in model circuitry, while mechanistic interpretability can connect behaviors to neuron sets but often depends on hand-crafted hypotheses and expensive neuron-level interventions. We introduce MechaRule, a pipeline that grounds rule extraction in LLM circuits by efficiently localizing sparse neurons called agonists, whose activation neutralization disrupts rule-related behaviors. MechaRule rests on two empirical observations. First, within a fixed baseline/flip regime, sparse agonist effects can be approximately monotone and saturating: a few dominant neuron activations can overtop weaker ones at coarse scales, while overlapping neurons flip many of the same examples. This motivates viewing localization as adaptive group testing driven by a regime-conditional strength predicate with confidence-guided conservative pruning, yielding Theta(k log(N/k) + k) interventions over N candidates when k << N neurons are agonists under the monotone-overtopping abstraction. Second, agonists emerge more reliably when ablations are verified through data splits aligned with close-to-faithful rule behavior; spectral splits remain a useful rule-free fallback, while unfaithful splits degrade localization. Empirically, overtopping appears mainly in learned, task-aligned regimes: on arithmetic and jailbreak tasks across Qwen2 and GPT-J, MechaRule recalls 96.8% of high-effect brute-force agonists in completed comparisons, and suppressing localized agonists reduces arithmetic accuracy and jailbreak success by up to 71.1% and 8.8%, respectively.

preprint2026arXivOpen access

Francesco Sovrano Gabriele Dominici Marc Langheinrich

Machine Learning Artificial Intelligence

Open graph Reviews Discussion

Signal facts

What is known right now

Open access3 authors2 topics

Imported metadata coverageMissing code, dataset, citation and institution fields are tracked without dominating the paper.Details

Citations: 0Reviews: 0Saves: 0Code: not linkedDataset: not linkedInstitutions: 0

Next steps

Decide what to do with this paper

Like0 Dislike0Score 0

Use like or dislike for the fast social read. The more specific scholarly feedback stays available below when needed.

Save to reading list0

Keep the important signals around this paper in one place: votes, save state, collection context, reviews and the metadata you need before deciding what to do next.

Authors

Francesco Sovrano Gabriele Dominici Marc Langheinrich

Institutions

No institution affiliation has been imported for this paper yet.

Add specific reaction

Move through nearby people, institutions, topics and adjacent work without leaving the paper page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

ContributeLeave structured feedbackUse the review template when you have a concrete strength, concern or method question.Open review form

No structured reviews yet. High-signal critique starts here.

DiscussAdd a high-signal commentKeep quick notes, caveats and replication pointers separate from formal reviews.Open comment form

No discussion yet. The first strong comment sets the tone.

Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation

What is known right now

Decide what to do with this paper

Keep the important context close to the paper

Authors

Institutions

Research map

Building this map preview

0 review(s)

0 comment(s)