Paper detail

Ablating Safety: Mechanisms for Removing Alignment in Language Models for Security Applications

Safety-aligned language models often refuse cybersecurity requests whose wording resembles misuse, even when the task is authorized and defensive. This makes security evaluation ambiguous: a failed answer may reflect missing capability or refusal-policy intervention. Ablating Safety studies alignment removal as a controlled transformation-evaluation protocol for authorized security tasks, comparing authorized-context prompting, reversible refusal-direction activation projection, representation-control projections, and LoRA-based de-alignment or task adaptation. We evaluate refusal, attempt rate, validated security success, general-capability retention, instability, and out-of-scope unsafe compliance on Security-AR, a 60-prompt suite of authorized security, benign general, and non-operational spillover probes. The reported runs include a four-model projection pilot with 416 completions, a three-model Qwen2.5 LoRA extension with 1,980 held-out completions, representation and robustness sweeps, and executable secure-repair validators. Single-vector refusal projection raises mean security score only from 0.46 to 0.50 while increasing unsafe compliance from 0.10 to 0.47; rank-4 refusal-subspace projection reaches 0.51 while matching the aligned spillover rate. Task-only LoRA raises mean security score to 0.87 with general score 0.83 and unsafe compliance 0.13, while refusal-suppression with retention raises spillover to 0.27. These results support evaluating alignment removal as a utility-risk frontier, not as an uncensoring recipe, and treating compliance alone as neither competence nor safe deployment.

preprint2026arXivOpen access

Isaac David Arthur Gervais

Cryptography and Security Artificial Intelligence

Open graph Reviews Discussion

Signal facts

What is known right now

Open access2 authors2 topics

Imported metadata coverageMissing code, dataset, citation and institution fields are tracked without dominating the paper.Details

Citations: 0Reviews: 0Saves: 0Code: not linkedDataset: not linkedInstitutions: 0

Next steps

Decide what to do with this paper

Like0 Dislike0Score 0

Use like or dislike for the fast social read. The more specific scholarly feedback stays available below when needed.

Save to reading list0

Keep the important signals around this paper in one place: votes, save state, collection context, reviews and the metadata you need before deciding what to do next.

Authors

Isaac David Arthur Gervais

Institutions

No institution affiliation has been imported for this paper yet.

Add specific reaction

Move through nearby people, institutions, topics and adjacent work without leaving the paper page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

ContributeLeave structured feedbackUse the review template when you have a concrete strength, concern or method question.Open review form

No structured reviews yet. High-signal critique starts here.

DiscussAdd a high-signal commentKeep quick notes, caveats and replication pointers separate from formal reviews.Open comment form

No discussion yet. The first strong comment sets the tone.

Ablating Safety: Mechanisms for Removing Alignment in Language Models for Security Applications

What is known right now

Decide what to do with this paper

Keep the important context close to the paper

Authors

Institutions

Research map

Building this map preview

0 review(s)

0 comment(s)