Interpreting Language Models Through Concept Descriptions: A Survey
- URL: http://arxiv.org/abs/2510.01048v1
- Date: Wed, 01 Oct 2025 15:51:44 GMT
- Title: Interpreting Language Models Through Concept Descriptions: A Survey
- Authors: Nils Feldhus, Laura Kopf,
- Abstract summary: We provide the first survey of the emerging field of concept descriptions for model components and abstractions.<n>Our synthesis reveals a growing demand for more rigorous, causal evaluation.
- Score: 3.901807843411349
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding the decision-making processes of neural networks is a central goal of mechanistic interpretability. In the context of Large Language Models (LLMs), this involves uncovering the underlying mechanisms and identifying the roles of individual model components such as neurons and attention heads, as well as model abstractions such as the learned sparse features extracted by Sparse Autoencoders (SAEs). A rapidly growing line of work tackles this challenge by using powerful generator models to produce open-vocabulary, natural language concept descriptions for these components. In this paper, we provide the first survey of the emerging field of concept descriptions for model components and abstractions. We chart the key methods for generating these descriptions, the evolving landscape of automated and human metrics for evaluating them, and the datasets that underpin this research. Our synthesis reveals a growing demand for more rigorous, causal evaluation. By outlining the state of the art and identifying key challenges, this survey provides a roadmap for future research toward making models more transparent.
Related papers
- From Text to Graph: Leveraging Graph Neural Networks for Enhanced Explainability in NLP [3.864700176441583]
This study proposes a novel methodology to achieve explainability in natural language processing tasks.<n>It automatically converts sentences into graphs and maintains semantics through nodes and relations.<n>Experiments delivered promising results in determining the most critical components within the text structure for a given classification.
arXiv Detail & Related papers (2025-04-02T18:55:58Z) - A Survey of Model Architectures in Information Retrieval [59.61734783818073]
The period from 2019 to the present has represented one of the biggest paradigm shifts in information retrieval (IR) and natural language processing (NLP)<n>We trace the development from traditional term-based methods to modern neural approaches, particularly highlighting the impact of transformer-based models and subsequent large language models (LLMs)<n>We conclude with a forward-looking discussion of emerging challenges and future directions.
arXiv Detail & Related papers (2025-02-20T18:42:58Z) - VisualPredicator: Learning Abstract World Models with Neuro-Symbolic Predicates for Robot Planning [86.59849798539312]
We present Neuro-Symbolic Predicates, a first-order abstraction language that combines the strengths of symbolic and neural knowledge representations.<n>We show that our approach offers better sample complexity, stronger out-of-distribution generalization, and improved interpretability.
arXiv Detail & Related papers (2024-10-30T16:11:05Z) - From Feature Importance to Natural Language Explanations Using LLMs with RAG [4.204990010424084]
We introduce traceable question-answering, leveraging an external knowledge repository to inform responses of Large Language Models (LLMs)
This knowledge repository comprises contextual details regarding the model's output, containing high-level features, feature importance, and alternative probabilities.
We integrate four key characteristics - social, causal, selective, and contrastive - drawn from social science research on human explanations into a single-shot prompt, guiding the response generation process.
arXiv Detail & Related papers (2024-07-30T17:27:20Z) - Automatic Discovery of Visual Circuits [66.99553804855931]
We explore scalable methods for extracting the subgraph of a vision model's computational graph that underlies recognition of a specific visual concept.
We find that our approach extracts circuits that causally affect model output, and that editing these circuits can defend large pretrained models from adversarial attacks.
arXiv Detail & Related papers (2024-04-22T17:00:57Z) - Large Language Models for Information Retrieval: A Survey [83.75872593741578]
Information retrieval has evolved from term-based methods to its integration with advanced neural models.<n>Recent research has sought to leverage large language models (LLMs) to improve IR systems.<n>We delve into the confluence of LLMs and IR systems, including crucial aspects such as query rewriters, retrievers, rerankers, and readers.
arXiv Detail & Related papers (2023-08-14T12:47:22Z) - Foundational Models Defining a New Era in Vision: A Survey and Outlook [151.49434496615427]
Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world.
The models learned to bridge the gap between such modalities coupled with large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time.
The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions.
arXiv Detail & Related papers (2023-07-25T17:59:18Z) - Explainability of Text Processing and Retrieval Methods: A Survey [1.5920521545285267]
This article provides a broad overview of research on the explainability and interpretability of natural language processing and information retrieval methods.<n>More specifically, we survey approaches that have been applied to explain word embeddings, sequence modeling, attention modules, transformers, BERT, and document ranking.
arXiv Detail & Related papers (2022-12-14T09:25:49Z) - FACT: Learning Governing Abstractions Behind Integer Sequences [7.895232155155041]
We introduce a novel view on the learning of concepts admitting complete finitary descriptions.
We lay down a set of benchmarking tasks aimed at conceptual understanding by machine learning models.
To further aid research in knowledge representation and reasoning, we present FACT, the Finitary Abstraction Toolkit.
arXiv Detail & Related papers (2022-09-20T08:20:03Z) - Towards Interpretable Deep Reinforcement Learning Models via Inverse
Reinforcement Learning [27.841725567976315]
We propose a novel framework utilizing Adversarial Inverse Reinforcement Learning.
This framework provides global explanations for decisions made by a Reinforcement Learning model.
We capture intuitive tendencies that the model follows by summarizing the model's decision-making process.
arXiv Detail & Related papers (2022-03-30T17:01:59Z) - Neural Entity Linking: A Survey of Models Based on Deep Learning [82.43751915717225]
This survey presents a comprehensive description of recent neural entity linking (EL) systems developed since 2015.
Its goal is to systemize design features of neural entity linking systems and compare their performance to the remarkable classic methods on common benchmarks.
The survey touches on applications of entity linking, focusing on the recently emerged use-case of enhancing deep pre-trained masked language models.
arXiv Detail & Related papers (2020-05-31T18:02:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.