Latent Concept Disentanglement in Transformer-based Language Models
- URL: http://arxiv.org/abs/2506.16975v2
- Date: Fri, 26 Sep 2025 13:37:33 GMT
- Title: Latent Concept Disentanglement in Transformer-based Language Models
- Authors: Guan Zhe Hong, Bhavya Vasudeva, Vatsal Sharan, Cyrus Rashtchian, Prabhakar Raghavan, Rina Panigrahy,
- Abstract summary: Large language models (LLMs) use in-context learning (ICL) to solve a new task.<n>This raises the question of whether and how transformers represent latent structures as part of their computation.<n>We study this question using mechanistic interpretability.
- Score: 15.764142646256785
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: When large language models (LLMs) use in-context learning (ICL) to solve a new task, they must infer latent concepts from demonstration examples. This raises the question of whether and how transformers represent latent structures as part of their computation. Our work experiments with several controlled tasks, studying this question using mechanistic interpretability. First, we show that in transitive reasoning tasks with a latent, discrete concept, the model successfully identifies the latent concept and does step-by-step concept composition. This builds upon prior work that analyzes single-step reasoning. Then, we consider tasks parameterized by a latent numerical concept. We discover low-dimensional subspaces in the model's representation space, where the geometry cleanly reflects the underlying parameterization. Overall, we show that small and large models can indeed disentangle and utilize latent concepts that they learn in-context from a handful of abbreviated demonstrations.
Related papers
- FaCT: Faithful Concept Traces for Explaining Neural Network Decisions [56.796533084868884]
Deep networks have shown remarkable performance across a wide range of tasks, yet getting a global concept-level understanding of how they function remains a key challenge.<n>We put emphasis on the faithfulness of concept-based explanations and propose a new model with model-inherent mechanistic concept-explanations.<n>Our concepts are shared across classes and, from any layer, their contribution to the logit and their input-visualization can be faithfully traced.
arXiv Detail & Related papers (2025-10-29T13:35:46Z) - Concept Layers: Enhancing Interpretability and Intervenability via LLM Conceptualization [2.163881720692685]
We introduce a new methodology for incorporating interpretability and intervenability into an existing model by integrating Concept Layers into its architecture.<n>Our approach projects the model's internal vector representations into a conceptual, explainable vector space before reconstructing and feeding them back into the model.<n>We evaluate CLs across multiple tasks, demonstrating that they maintain the original model's performance and agreement while enabling meaningful interventions.
arXiv Detail & Related papers (2025-02-19T11:10:19Z) - Mechanistic Unveiling of Transformer Circuits: Self-Influence as a Key to Model Reasoning [9.795934690403374]
It is still unclear which multi-step reasoning mechanisms are used by language models to solve such tasks.<n>We employ circuit analysis and self-influence functions to evaluate the changing importance of each token throughout the reasoning process.<n>We demonstrate that the underlying circuits reveal a human-interpretable reasoning process used by the model.
arXiv Detail & Related papers (2025-02-13T07:19:05Z) - Large Concept Models: Language Modeling in a Sentence Representation Space [62.73366944266477]
We present an attempt at an architecture which operates on an explicit higher-level semantic representation, which we name a concept.<n> Concepts are language- and modality-agnostic and represent a higher level idea or action in a flow.<n>We show that our model exhibits impressive zero-shot generalization performance to many languages.
arXiv Detail & Related papers (2024-12-11T23:36:20Z) - Sparse autoencoders reveal selective remapping of visual concepts during adaptation [54.82630842681845]
Adapting foundation models for specific purposes has become a standard approach to build machine learning systems.<n>We develop a new Sparse Autoencoder (SAE) for the CLIP vision transformer, named PatchSAE, to extract interpretable concepts.
arXiv Detail & Related papers (2024-12-06T18:59:51Z) - Provably Transformers Harness Multi-Concept Word Semantics for Efficient In-Context Learning [53.685764040547625]
Transformer-based large language models (LLMs) have displayed remarkable creative prowess and emergence capabilities.
This work provides a fine mathematical analysis to show how transformers leverage the multi-concept semantics of words to enable powerful ICL and excellent out-of-distribution ICL abilities.
arXiv Detail & Related papers (2024-11-04T15:54:32Z) - Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution [23.594013836364628]
We propose an approach to approximate the subspace representing a specific concept.<n>We demonstrate GCS's effectiveness through measuring its faithfulness and plausibility across multiple large language models.<n>We also use representation intervention tasks to showcase its efficacy in real-world applications such as emotion steering.
arXiv Detail & Related papers (2024-09-30T18:52:53Z) - How to Blend Concepts in Diffusion Models [48.68800153838679]
Recent methods exploit multiple latent representations and their connection, making this research question even more entangled.
Our goal is to understand how operations in the latent space affect the underlying concepts.
Our conclusion is that concept blending through space manipulation is possible, although the best strategy depends on the context of the blend.
arXiv Detail & Related papers (2024-07-19T13:05:57Z) - PaCE: Parsimonious Concept Engineering for Large Language Models [57.740055563035256]
Parsimonious Concept Engineering (PaCE) is a novel activation engineering framework for alignment.
We construct a large-scale concept dictionary in the activation space, in which each atom corresponds to a semantic concept.
We show that PaCE achieves state-of-the-art alignment performance while maintaining linguistic capabilities.
arXiv Detail & Related papers (2024-06-06T17:59:10Z) - On the Tip of the Tongue: Analyzing Conceptual Representation in Large
Language Models with Reverse-Dictionary Probe [36.65834065044746]
We use in-context learning to guide the models to generate the term for an object concept implied in a linguistic description.
Experiments suggest that conceptual inference ability as probed by the reverse-dictionary task predicts model's general reasoning performance.
arXiv Detail & Related papers (2024-02-22T09:45:26Z) - Identifying Linear Relational Concepts in Large Language Models [16.917379272022064]
Transformer language models (LMs) have been shown to represent concepts as directions in the latent space of hidden activations.
We present a technique called linear relational concepts (LRC) for finding concept directions corresponding to human-interpretable concepts.
arXiv Detail & Related papers (2023-11-15T14:01:41Z) - Interpreting Pretrained Language Models via Concept Bottlenecks [55.47515772358389]
Pretrained language models (PLMs) have made significant strides in various natural language processing tasks.
The lack of interpretability due to their black-box'' nature poses challenges for responsible implementation.
We propose a novel approach to interpreting PLMs by employing high-level, meaningful concepts that are easily understandable for humans.
arXiv Detail & Related papers (2023-11-08T20:41:18Z) - Faith and Fate: Limits of Transformers on Compositionality [109.79516190693415]
We investigate the limits of transformer large language models across three representative compositional tasks.
These tasks require breaking problems down into sub-steps and synthesizing these steps into a precise answer.
Our empirical findings suggest that transformer LLMs solve compositional tasks by reducing multi-step compositional reasoning into linearized subgraph matching.
arXiv Detail & Related papers (2023-05-29T23:24:14Z) - Inverse Dynamics Pretraining Learns Good Representations for Multitask
Imitation [66.86987509942607]
We evaluate how such a paradigm should be done in imitation learning.
We consider a setting where the pretraining corpus consists of multitask demonstrations.
We argue that inverse dynamics modeling is well-suited to this setting.
arXiv Detail & Related papers (2023-05-26T14:40:46Z) - ConceptX: A Framework for Latent Concept Analysis [21.760620298330235]
We present ConceptX, a human-in-the-loop framework for interpreting and annotating latent representational space in Language Models (pLMs)
We use an unsupervised method to discover concepts learned in these models and enable a graphical interface for humans to generate explanations for the concepts.
arXiv Detail & Related papers (2022-11-12T11:31:09Z) - On the Transformation of Latent Space in Fine-Tuned NLP Models [21.364053591693175]
We study the evolution of latent space in fine-tuned NLP models.
We discover latent concepts in the representational space using hierarchical clustering.
We compare pre-trained and fine-tuned models across three models and three downstream tasks.
arXiv Detail & Related papers (2022-10-23T10:59:19Z) - RelViT: Concept-guided Vision Transformer for Visual Relational
Reasoning [139.0548263507796]
We use vision transformers (ViTs) as our base model for visual reasoning.
We make better use of concepts defined as object entities and their relations to improve the reasoning ability of ViTs.
We show the resulting model, Concept-guided Vision Transformer (or RelViT for short), significantly outperforms prior approaches on HICO and GQA benchmarks.
arXiv Detail & Related papers (2022-04-24T02:46:43Z) - Human-Centered Concept Explanations for Neural Networks [47.71169918421306]
We introduce concept explanations including the class of Concept Activation Vectors (CAV)
We then discuss approaches to automatically extract concepts, and approaches to address some of their caveats.
Finally, we discuss some case studies that showcase the utility of such concept-based explanations in synthetic settings and real world applications.
arXiv Detail & Related papers (2022-02-25T01:27:31Z) - Interpretable Visual Reasoning via Induced Symbolic Space [75.95241948390472]
We study the problem of concept induction in visual reasoning, i.e., identifying concepts and their hierarchical relationships from question-answer pairs associated with images.
We first design a new framework named object-centric compositional attention model (OCCAM) to perform the visual reasoning task with object-level visual features.
We then come up with a method to induce concepts of objects and relations using clues from the attention patterns between objects' visual features and question words.
arXiv Detail & Related papers (2020-11-23T18:21:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.