Related papers: Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units

Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units

URL: http://arxiv.org/abs/2601.21996v1
Date: Thu, 29 Jan 2026 17:06:54 GMT
Title: Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units
Authors: Jianhui Chen, Yuzhang Luo, Liangming Pan,
Abstract summary: We introduce Mechanistic Data Attribution (MDA), a scalable framework that employs Influence Functions to trace interpretable units back to specific training samples.<n>We causally validate that targeted intervention--removing or augmenting a small fraction of high-influence samples--significantly modulates the emergence of interpretable heads.
Score: 34.05875226612676
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While Mechanistic Interpretability has identified interpretable circuits in LLMs, their causal origins in training data remain elusive. We introduce Mechanistic Data Attribution (MDA), a scalable framework that employs Influence Functions to trace interpretable units back to specific training samples. Through extensive experiments on the Pythia family, we causally validate that targeted intervention--removing or augmenting a small fraction of high-influence samples--significantly modulates the emergence of interpretable heads, whereas random interventions show no effect. Our analysis reveals that repetitive structural data (e.g., LaTeX, XML) acts as a mechanistic catalyst. Furthermore, we observe that interventions targeting induction head formation induce a concurrent change in the model's in-context learning (ICL) capability. This provides direct causal evidence for the long-standing hypothesis regarding the functional link between induction heads and ICL. Finally, we propose a mechanistic data augmentation pipeline that consistently accelerates circuit convergence across model scales, providing a principled methodology for steering the developmental trajectories of LLMs.

Related papers

Guided Verifier: Collaborative Multimodal Reasoning via Dynamic Process Supervision [11.159231524113764]
Reinforcement Learning (RL) has emerged as a pivotal mechanism for enhancing the complex reasoning capabilities of Multimodal Large Language Models (MLLMs)<n>In this paper, we propose the textbfGuided Verifier framework to address these structural limitations.<n>We develop a specialized data synthesis pipeline targeting multimodal hallucinations, constructing textbfCoRe dataset of process-level negatives and textbfCorrect-guide textbfReasoning trajectories to train the guided verifier.
arXiv Detail & Related papers (2026-02-04T07:38:42Z)
How and Why LLMs Generalize: A Fine-Grained Analysis of LLM Reasoning from Cognitive Behaviors to Low-Level Patterns [51.02752099869218]
Large Language Models (LLMs) display strikingly different generalization behaviors.<n>We introduce a novel benchmark that decomposes reasoning into atomic core skills.<n>We show that RL-tuned models maintain more stable behavioral profiles and resist collapse in reasoning skills, whereas SFT models exhibit sharper drift and overfit to surface patterns.
arXiv Detail & Related papers (2025-12-30T08:16:20Z)
Disentangling Recall and Reasoning in Transformer Models through Layer-wise Attention and Activation Analysis [3.1526281887627587]
Distinguishing recall from reasoning is crucial for predicting model generalization.<n>We use controlled datasets of synthetic linguistic puzzles to probe transformer models at the layer, head, and neuron level.<n>Our results provide the first causal evidence that recall and reasoning rely on separable but interacting circuits in transformer models.
arXiv Detail & Related papers (2025-10-03T04:13:06Z)
Migration as a Probe: A Generalizable Benchmark Framework for Specialist vs. Generalist Machine-Learned Force Fields [1.572216094651749]
Machine-learned force fields (MLFFs) are transforming computational materials science by enabling ab initio-level accuracy at molecular dynamics scales.<n>Yet their rapid rise raises a key question: should researchers train specialist models from scratch, fine-tune generalist foundation models, or use hybrid approaches?<n>We introduce a benchmarking framework using defect migration pathways, evaluated through elastic band trajectories, as diagnostic probes.<n>Fine-tuned models substantially outperform from-scratch and zero-shot approaches for kinetic properties but show partial loss of long-range physics.
arXiv Detail & Related papers (2025-08-27T13:24:41Z)
Factual Self-Awareness in Language Models: Representation, Robustness, and Scaling [56.26834106704781]
Factual incorrectness in generated content is one of the primary concerns in ubiquitous deployment of large language models (LLMs)<n>We provide evidence supporting the presence of LLMs' internal compass that dictate the correctness of factual recall at the time of generation.<n>Scaling experiments across model sizes and training dynamics highlight that self-awareness emerges rapidly during training and peaks in intermediate layers.
arXiv Detail & Related papers (2025-05-27T16:24:02Z)
Illusion or Algorithm? Investigating Memorization, Emergence, and Symbolic Processing in In-Context Learning [50.53703102032562]
Large-scale Transformer language models (LMs) trained solely on next-token prediction with web-scale data can solve a wide range of tasks.<n>The mechanism behind this capability, known as in-context learning (ICL), remains both controversial and poorly understood.
arXiv Detail & Related papers (2025-05-16T08:50:42Z)
Latent Causal Probing: A Formal Perspective on Probing with Causal Models of Data [3.376269351435396]
We develop a formal perspective on probing using structural causal models (SCM) We extend a recent study of LMs in the context of a synthetic grid-world navigation task. Our techniques provide robust empirical evidence for the ability of LMs to induce the latent concepts underlying text.
arXiv Detail & Related papers (2024-07-18T17:59:27Z)
How Do Nonlinear Transformers Learn and Generalize in In-Context Learning? [82.51626700527837]
Transformer-based large language models displayed impressive in-context learning capabilities, where a pre-trained model can handle new tasks without fine-tuning. We analyze how the mechanics of how Transformer to achieve ICL contribute to the technical challenges of the training problems in Transformers.
arXiv Detail & Related papers (2024-02-23T21:07:20Z)
SLEM: Machine Learning for Path Modeling and Causal Inference with Super Learner Equation Modeling [3.988614978933934]
Causal inference is a crucial goal of science, enabling researchers to arrive at meaningful conclusions using observational data. Path models, Structural Equation Models (SEMs) and Directed Acyclic Graphs (DAGs) provide a means to unambiguously specify assumptions regarding the causal structure underlying a phenomenon. We propose Super Learner Equation Modeling, a path modeling technique integrating machine learning Super Learner ensembles.
arXiv Detail & Related papers (2023-08-08T16:04:42Z)
Discovering Latent Causal Variables via Mechanism Sparsity: A New Principle for Nonlinear ICA [81.4991350761909]
Independent component analysis (ICA) refers to an ensemble of methods which formalize this goal and provide estimation procedure for practical application. We show that the latent variables can be recovered up to a permutation if one regularizes the latent mechanisms to be sparse.
arXiv Detail & Related papers (2021-07-21T14:22:14Z)
Counterfactual Maximum Likelihood Estimation for Training Deep Networks [83.44219640437657]
Deep learning models are prone to learning spurious correlations that should not be learned as predictive clues. We propose a causality-based training framework to reduce the spurious correlations caused by observable confounders. We conduct experiments on two real-world tasks: Natural Language Inference (NLI) and Image Captioning.
arXiv Detail & Related papers (2021-06-07T17:47:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.