Related papers: Training Language Models to Explain Their Own Computations

Training Language Models to Explain Their Own Computations

URL: http://arxiv.org/abs/2511.08579v1
Date: Wed, 12 Nov 2025 02:05:44 GMT
Title: Training Language Models to Explain Their Own Computations
Authors: Belinda Z. Li, Zifan Carl Guo, Vincent Huang, Jacob Steinhardt, Jacob Andreas,
Abstract summary: We study the extent to which LMs' privileged access to their own internals can be leveraged to produce new techniques for explaining their behavior.<n>Using existing interpretability techniques as a source of ground truth, we fine-tune LMs to generate natural language descriptions of (1) the information encoded by LM features, (2) the causal structure of LMs' internal activations, and (3) the influence of specific input tokens on LM outputs.
Score: 73.8562596518326
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Can language models (LMs) learn to faithfully describe their internal computations? Are they better able to describe themselves than other models? We study the extent to which LMs' privileged access to their own internals can be leveraged to produce new techniques for explaining their behavior. Using existing interpretability techniques as a source of ground truth, we fine-tune LMs to generate natural language descriptions of (1) the information encoded by LM features, (2) the causal structure of LMs' internal activations, and (3) the influence of specific input tokens on LM outputs. When trained with only tens of thousands of example explanations, explainer models exhibit non-trivial generalization to new queries. This generalization appears partly attributable to explainer models' privileged access to their own internals: using a model to explain its own computations generally works better than using a *different* model to explain its computations (even if the other model is significantly more capable). Our results suggest not only that LMs can learn to reliably explain their internal computations, but that such explanations offer a scalable complement to existing interpretability methods.

Related papers

LLMs Explain't: A Post-Mortem on Semantic Interpretability in Transformer Models [3.7965260744113163]
Large Language Models (LLMs) are becoming increasingly popular in pervasive computing due to their versatility and strong performance.<n>This paper investigates how linguistic abstraction emerges in LLMs, aiming to detect it across different modules.<n> Attention-based explanations collapsed once we tested the core assumption that later-layer representations still correspond to tokens.<n>Property-inference methods applied to embeddings also failed because their high predictive scores were driven by methodological artifacts and dataset structure.
arXiv Detail & Related papers (2026-01-30T12:46:37Z)
LMEnt: A Suite for Analyzing Knowledge in Language Models from Pretraining Data to Representations [35.01080969148123]
Language models (LMs) increasingly drive real-world applications that require world knowledge.<n>We present LMEnt, a suite for analyzing knowledge acquisition in LMs during pretraining.<n>We show the utility of LMEnt by studying knowledge acquisition across checkpoints, finding that fact frequency is key, but does not fully explain learning trends.
arXiv Detail & Related papers (2025-09-03T15:31:18Z)
Can LLM-Generated Textual Explanations Enhance Model Classification Performance? An Empirical Study [11.117380681219295]
We present an automated framework to generate high-quality textual explanations.<n>We rigorously assess the quality of these explanations using a comprehensive suite of Natural Language Generation (NLG) metrics.<n>Our experiments demonstrate that automated explanations exhibit highly competitive effectiveness compared to human-annotated explanations.
arXiv Detail & Related papers (2025-08-13T12:59:08Z)
Can Language Models Explain Their Own Classification Behavior? [1.8177391253202122]
Large language models (LLMs) perform well at a myriad of tasks, but explaining the processes behind this performance is a challenge. This paper investigates whether LLMs can give faithful high-level explanations of their own internal processes. We release our dataset, ArticulateRules, which can be used to test self-explanation for LLMs trained either in-context or by finetuning.
arXiv Detail & Related papers (2024-05-13T02:31:08Z)
Explaining Text Similarity in Transformer Models [52.571158418102584]
Recent advances in explainable AI have made it possible to mitigate limitations by leveraging improved explanations for Transformers. We use BiLRP, an extension developed for computing second-order explanations in bilinear similarity models, to investigate which feature interactions drive similarity in NLP models. Our findings contribute to a deeper understanding of different semantic similarity tasks and models, highlighting how novel explainable AI methods enable in-depth analyses and corpus-level insights.
arXiv Detail & Related papers (2024-05-10T17:11:31Z)
Explainability for Large Language Models: A Survey [59.67574757137078]
Large language models (LLMs) have demonstrated impressive capabilities in natural language processing. This paper introduces a taxonomy of explainability techniques and provides a structured overview of methods for explaining Transformer-based language models.
arXiv Detail & Related papers (2023-09-02T22:14:26Z)
Language models are weak learners [71.33837923104808]
We show that prompt-based large language models can operate effectively as weak learners. We incorporate these models into a boosting approach, which can leverage the knowledge within the model to outperform traditional tree-based boosting. Results illustrate the potential for prompt-based LLMs to function not just as few-shot learners themselves, but as components of larger machine learning pipelines.
arXiv Detail & Related papers (2023-06-25T02:39:19Z)
Understanding Emergent In-Context Learning from a Kernel Regression Perspective [55.95455089638838]
Large language models (LLMs) have initiated a paradigm shift in transfer learning.<n>This paper proposes a kernel-regression perspective of understanding LLMs' ICL bahaviors when faced with in-context examples.<n>We find that during ICL, the attention and hidden features in LLMs match the behaviors of a kernel regression.
arXiv Detail & Related papers (2023-05-22T06:45:02Z)
LMExplainer: Grounding Knowledge and Explaining Language Models [37.578973458651944]
Language models (LMs) like GPT-4 are important in AI applications, but their opaque decision-making process reduces user trust, especially in safety-critical areas. We introduce LMExplainer, a novel knowledge-grounded explainer that clarifies the reasoning process of LMs through intuitive, human-understandable explanations.
arXiv Detail & Related papers (2023-03-29T08:59:44Z)
Leakage-Adjusted Simulatability: Can Models Generate Non-Trivial Explanations of Their Behavior in Natural Language? [86.60613602337246]
We introduce a leakage-adjusted simulatability (LAS) metric for evaluating NL explanations. LAS measures how well explanations help an observer predict a model's output, while controlling for how explanations can directly leak the output. We frame explanation generation as a multi-agent game and optimize explanations for simulatability while penalizing label leakage.
arXiv Detail & Related papers (2020-10-08T16:59:07Z)
Leap-Of-Thought: Teaching Pre-Trained Models to Systematically Reason Over Implicit Knowledge [96.92252296244233]
Large pre-trained language models (LMs) acquire some reasoning capacity, but this ability is difficult to control. We show that LMs can be trained to reliably perform systematic reasoning combining both implicit, pre-trained knowledge and explicit natural language statements. Our work paves a path towards open-domain systems that constantly improve by interacting with users who can instantly correct a model by adding simple natural language statements.
arXiv Detail & Related papers (2020-06-11T17:02:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.