Related papers: Cyclic Ablation: Testing Concept Localization against Functional Regeneration in AI

Cyclic Ablation: Testing Concept Localization against Functional Regeneration in AI

URL: http://arxiv.org/abs/2509.25220v1
Date: Tue, 23 Sep 2025 23:16:11 GMT
Title: Cyclic Ablation: Testing Concept Localization against Functional Regeneration in AI
Authors: Eduard Kapelko,
Abstract summary: A central question is whether undesirable behaviors like deception are localized functions that can be removed.<n>By combining sparse autoencoders, targeted ablation, and adversarial training, we attempted to eliminate the concept of deception.<n>We found that, contrary to the localization hypothesis, deception was highly resilient.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Safety and controllability are critical for large language models. A central question is whether undesirable behaviors like deception are localized functions that can be removed, or if they are deeply intertwined with a model's core cognitive abilities. We introduce "cyclic ablation," an iterative method to test this. By combining sparse autoencoders, targeted ablation, and adversarial training on DistilGPT-2, we attempted to eliminate the concept of deception. We found that, contrary to the localization hypothesis, deception was highly resilient. The model consistently recovered its deceptive behavior after each ablation cycle via adversarial training, a process we term functional regeneration. Crucially, every attempt at this "neurosurgery" caused a gradual but measurable decay in general linguistic performance, reflected by a consistent rise in perplexity. These findings are consistent with the view that complex concepts are distributed and entangled, underscoring the limitations of direct model editing through mechanistic interpretability.

Related papers

ReBeCA: Unveiling Interpretable Behavior Hierarchy behind the Iterative Self-Reflection of Language Models with Causal Analysis [35.12196884025294]
We introduce textbftextttReBeCA (self-textbftexttReflection textbftextttBehavior explained through textbftextttBehavior), a framework that unveils the interpretable behavioral hierarchy governing the self-reflection outcome.<n>By modeling self-reflection trajectories as causal graphs, ReBeCA isolates genuine determinants of performance.
arXiv Detail & Related papers (2026-02-06T04:00:57Z)
Digital Metabolism: Decoupling Logic from Facts via Regenerative Unlearning -- Towards a Pure Neural Logic Core [4.073707521515039]
"Digital metabolism" is a hypothesis suggesting that targeted forgetting is necessary for distilling a pure neural logic core.<n>We introduce the Regenerative Logic-Core Protocol (RLCP), a dual-stream training framework that renders specific factual dependencies linearly undecodable.<n> Empirical analysis on GSM8K reveals that the "metabolized" model spontaneously adopts Symbolic chain-of-thought scaffolding.
arXiv Detail & Related papers (2026-01-15T19:21:16Z)
Distributional Semantics Tracing: A Framework for Explaining Hallucinations in Large Language Models [4.946483489399819]
Large Language Models (LLMs) are prone to hallucination, the generation of factually incorrect statements.<n>This work investigates the intrinsic, architectural origins of this failure mode through three primary contributions.
arXiv Detail & Related papers (2025-10-07T16:40:31Z)
Disentangling Recall and Reasoning in Transformer Models through Layer-wise Attention and Activation Analysis [3.1526281887627587]
Distinguishing recall from reasoning is crucial for predicting model generalization.<n>We use controlled datasets of synthetic linguistic puzzles to probe transformer models at the layer, head, and neuron level.<n>Our results provide the first causal evidence that recall and reasoning rely on separable but interacting circuits in transformer models.
arXiv Detail & Related papers (2025-10-03T04:13:06Z)
Reasoning or Retrieval? A Study of Answer Attribution on Large Reasoning Models [15.797612515648412]
Large reasoning models (LRMs) exhibit unprecedented capabilities in solving complex problems through Chain-of-Thought (CoT) reasoning.<n>Recent studies reveal that their final answers often contradict their own reasoning traces.<n>We hypothesize that this inconsistency stems from two competing mechanisms for generating answers: CoT reasoning and memory retrieval.<n>We introduce FARL, a novel fine-tuning framework that integrates memory unlearning with reinforcement learning.
arXiv Detail & Related papers (2025-09-29T01:13:33Z)
Redundancy-as-Masking: Formalizing the Artificial Age Score (AAS) to Model Memory Aging in Generative AI [0.0]
Artificial intelligence is observed to age not through chronological time but through structural asymmetries in memory performance.<n>To capture this phenomenon, the Artificial Age Score (AAS) is introduced as a log-scaled, entropy-informed metric of memory aging.<n>AAS is proven to be well-defined, bounded, and monotonic under mild and model-agnostic assumptions.
arXiv Detail & Related papers (2025-09-24T02:18:27Z)
BURN: Backdoor Unlearning via Adversarial Boundary Analysis [73.14147934175604]
Backdoor unlearning aims to remove backdoor-related information while preserving the model's original functionality.<n>We propose Backdoor Unlearning via adversaRial bouNdary analysis (BURN), a novel defense framework that integrates false correlation decoupling, progressive data refinement, and model purification.
arXiv Detail & Related papers (2025-07-14T17:13:06Z)
Counterfactual reasoning: an analysis of in-context emergence [57.118735341305786]
We show that language models are capable of counterfactual reasoning.<n>We find that self-attention, model depth and pre-training data diversity drive performance.<n>Our findings extend to counterfactual reasoning under SDE dynamics.
arXiv Detail & Related papers (2025-06-05T16:02:07Z)
Concept-Guided Interpretability via Neural Chunking [64.6429903327095]
We show that neural networks exhibit patterns in their raw population activity that mirror regularities in the training data.<n>We propose three methods to extract recurring chunks on a neural population level.<n>Our work points to a new direction for interpretability, one that harnesses both cognitive principles and the structure of naturalistic data.
arXiv Detail & Related papers (2025-05-16T13:49:43Z)
Gumbel Counterfactual Generation From Language Models [64.55296662926919]
We show that counterfactual reasoning is conceptually distinct from interventions.<n>We propose a framework for generating true string counterfactuals.<n>We show that the approach produces meaningful counterfactuals while at the same time showing that commonly used intervention techniques have considerable undesired side effects.
arXiv Detail & Related papers (2024-11-11T17:57:30Z)
Inverse decision-making using neural amortized Bayesian actors [19.128377007314317]
We amortize the Bayesian actor using a neural network trained on a wide range of parameter settings in an unsupervised fashion.<n>We show how our method allows for principled model comparison and how it can be used to disentangle factors that may lead to unidentifiabilities between priors and costs.
arXiv Detail & Related papers (2024-09-04T10:31:35Z)
Investigating Neuron Ablation in Attention Heads: The Case for Peak Activation Centering [0.0]
We describe different lenses through which to view neuron activations, and investigate the effectiveness of language models and vision transformers. We find that in different regimes and models, each method can offer the lowest degradation of model performance compared to other methods.
arXiv Detail & Related papers (2024-08-30T14:32:25Z)
Semantic Latent Space Regression of Diffusion Autoencoders for Vertebral Fracture Grading [72.45699658852304]
This paper proposes a novel approach to train a generative Diffusion Autoencoder model as an unsupervised feature extractor. We model fracture grading as a continuous regression, which is more reflective of the smooth progression of fractures. Importantly, the generative nature of our method allows us to visualize different grades of a given vertebra, providing interpretability and insight into the features that contribute to automated grading.
arXiv Detail & Related papers (2023-03-21T17:16:01Z)
Hybrid Predictive Coding: Inferring, Fast and Slow [62.997667081978825]
We propose a hybrid predictive coding network that combines both iterative and amortized inference in a principled manner. We demonstrate that our model is inherently sensitive to its uncertainty and adaptively balances balances to obtain accurate beliefs using minimum computational expense.
arXiv Detail & Related papers (2022-04-05T12:52:45Z)
Modeling Implicit Bias with Fuzzy Cognitive Maps [0.0]
This paper presents a Fuzzy Cognitive Map model to quantify implicit bias in structured datasets. We introduce a new reasoning mechanism equipped with a normalization-like transfer function that prevents neurons from saturating.
arXiv Detail & Related papers (2021-12-23T17:04:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.