Related papers: Comparative Analysis of LLM Abliteration Methods: A Cross-Architecture Evaluation

Comparative Analysis of LLM Abliteration Methods: A Cross-Architecture Evaluation

URL: http://arxiv.org/abs/2512.13655v1
Date: Mon, 15 Dec 2025 18:48:42 GMT
Title: Comparative Analysis of LLM Abliteration Methods: A Cross-Architecture Evaluation
Authors: Richard J. Young,
Abstract summary: This study evaluates four abliteration tools across sixteen instruction-tuned models.<n>Single-pass methods demonstrated superior capability preservation on the benchmarked subset.<n>The principal finding indicates that mathematical reasoning capabilities exhibit the highest sensitivity to abliteration interventions.
Score: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Safety alignment mechanisms in large language models prevent responses to harmful queries through learned refusal behavior, yet these same mechanisms impede legitimate research applications including cognitive modeling, adversarial testing, and security analysis. While abliteration techniques enable surgical removal of refusal representations through directional orthogonalization, the relative effectiveness of available implementations remains uncharacterized. This study evaluates four abliteration tools (Heretic, DECCP, ErisForge, FailSpy) across sixteen instruction-tuned models (7B-14B parameters), reporting tool compatibility on all 16 models and quantitative metrics on subsets dictated by tool support. Single-pass methods demonstrated superior capability preservation on the benchmarked subset (avg GSM8K change across three models: ErisForge -0.28 pp; DECCP -0.13 pp), while Bayesian-optimized abliteration produced variable distribution shift (KL divergence: 0.043-1.646) with model-dependent capability impact. These findings provide researchers with evidence-based selection criteria for abliteration tool deployment across diverse model architectures. The principal finding indicates that mathematical reasoning capabilities exhibit the highest sensitivity to abliteration interventions, with GSM8K change ranging from +1.51 pp to -18.81 pp (-26.5% relative) depending on tool selection and model architecture.

Related papers

Scaling Trends for Multi-Hop Contextual Reasoning in Mid-Scale Language Models [0.0]
We present a controlled study of multi-hop contextual reasoning in large language models.<n>We show that multi-agent systems show the inverse pattern, achieving up to 80% on reasoning tasks where rule-based methods fail.
arXiv Detail & Related papers (2026-01-06T20:18:55Z)
Cross-LLM Generalization of Behavioral Backdoor Detection in AI Agent Supply Chains [0.0]
We present the first systematic study of cross-LLM behavioral backdoor detection.<n>We show that single-model detectors achieve 92.7% accuracy within their training distribution but only 49.2% across different LLMs.<n>We show that model-aware detection model identity as an additional feature achieves 90.6% accuracy universally across all evaluated models.
arXiv Detail & Related papers (2025-11-25T03:33:04Z)
Beyond Mimicry: Preference Coherence in LLMs [0.19116784879310025]
We investigate whether large language models exhibit genuine preference structures by testing their responses to AI-specific trade-offs.<n>We find 23 combinations (47.9%) demonstrated statistically significant relationships between scenario intensity and choice patterns.<n>Only 5 combinations (10.4%) demonstrate meaningful preference coherence through adaptive or threshold-based behavior.<n>The prevalence of unstable transitions (45.8%) and stimulus-specific sensitivities suggests current AI systems lack unified preference structures.
arXiv Detail & Related papers (2025-11-17T17:41:48Z)
An Analysis of Architectural Impact on LLM-based Abstract Visual Reasoning: A Systematic Benchmark on RAVEN-FAIR [0.0]
GPT-4.1-Mini consistently achieved the highest overall accuracy across all architectures.<n>Each model exhibited distinct sensitivity patterns to architectural design, underscoring that reasoning effectiveness remains model-specific.
arXiv Detail & Related papers (2025-11-14T22:50:22Z)
MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling [115.74855199827596]
MiroThinker is an open-source research agent designed to advance tool-augmented reasoning and information-seeking capabilities.<n>Unlike previous agents that only scale up model size or context length, MiroThinker explores interaction scaling at the model level.
arXiv Detail & Related papers (2025-11-14T18:52:07Z)
Model Correlation Detection via Random Selection Probing [62.093777777813756]
Existing similarity-based methods require access to model parameters or produce scores without thresholds.<n>We introduce Random Selection Probing (RSP), a hypothesis-testing framework that formulates model correlation detection as a statistical test.<n>RSP produces rigorous p-values that quantify evidence of correlation.
arXiv Detail & Related papers (2025-09-29T01:40:26Z)
Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning [53.45095336430027]
We develop a unified framework that combines implicit retrieval and structured collaboration.<n>On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy.<n>Results on SuperGPQA and TRQA confirm robustness across domains.
arXiv Detail & Related papers (2025-09-25T14:05:55Z)
Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors [61.92704516732144]
We show that the most robust features for correctness prediction are those that play a distinctive causal role in the model's behavior.<n>We propose two methods that leverage causal mechanisms to predict the correctness of model outputs.
arXiv Detail & Related papers (2025-05-17T00:31:39Z)
Causal Intervention Framework for Variational Auto Encoder Mechanistic Interpretability [0.0]
This paper introduces a comprehensive causal intervention framework for mechanistic interpretability of Variational Autoencoders (VAEs)<n>We develop techniques to identify and analyze "circuit motifs" in VAEs, examining how semantic factors are encoded, processed, and disentangled through the network layers.<n>Results show that our interventions can successfully isolate functional circuits, map computational graphs to causal graphs of semantic factors, and distinguish between polysemantic and monosemantic units.
arXiv Detail & Related papers (2025-05-06T13:40:59Z)
SASWISE-UE: Segmentation and Synthesis with Interpretable Scalable Ensembles for Uncertainty Estimation [6.082812294410541]
This paper introduces an efficient sub-model ensemble framework aimed at enhancing the interpretability of medical deep learning models. By generating uncertainty maps, this framework enables end-users to evaluate the reliability of model outputs.
arXiv Detail & Related papers (2024-11-08T04:37:55Z)
Tool-Augmented Reward Modeling [58.381678612409]
We propose a tool-augmented preference modeling approach, named Themis, to address limitations by empowering RMs with access to external environments. Our study delves into the integration of external tools into RMs, enabling them to interact with diverse external sources. In human evaluations, RLHF trained with Themis attains an average win rate of 32% when compared to baselines.
arXiv Detail & Related papers (2023-10-02T09:47:40Z)
Variable Importance Matching for Causal Inference [73.25504313552516]
We describe a general framework called Model-to-Match that achieves these goals. Model-to-Match uses variable importance measurements to construct a distance metric. We operationalize the Model-to-Match framework with LASSO.
arXiv Detail & Related papers (2023-02-23T00:43:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.