Related papers: Concept Component Analysis: A Principled Approach for Concept Extraction in LLMs

Concept Component Analysis: A Principled Approach for Concept Extraction in LLMs

URL: http://arxiv.org/abs/2601.20420v2
Date: Thu, 29 Jan 2026 04:05:29 GMT
Title: Concept Component Analysis: A Principled Approach for Concept Extraction in LLMs
Authors: Yuhang Liu, Erdun Gao, Dong Gong, Anton van den Hengel, Javen Qinfeng Shi,
Abstract summary: Mechanistic interpretability seeks to mitigate the issues through extracts from large language models.<n>Sparse autoencoders (SAEs) have emerged as a popular approach for extracting interpretable and monosemantic concepts.<n>We show that SAEs suffer from a fundamental theoretical ambiguity: the well-defined correspondence between LLM representations and human-interpretable concepts remains unclear.
Score: 51.378834857406325
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Developing human understandable interpretation of large language models (LLMs) becomes increasingly critical for their deployment in essential domains. Mechanistic interpretability seeks to mitigate the issues through extracts human-interpretable process and concepts from LLMs' activations. Sparse autoencoders (SAEs) have emerged as a popular approach for extracting interpretable and monosemantic concepts by decomposing the LLM internal representations into a dictionary. Despite their empirical progress, SAEs suffer from a fundamental theoretical ambiguity: the well-defined correspondence between LLM representations and human-interpretable concepts remains unclear. This lack of theoretical grounding gives rise to several methodological challenges, including difficulties in principled method design and evaluation criteria. In this work, we show that, under mild assumptions, LLM representations can be approximated as a {linear mixture} of the log-posteriors over concepts given the input context, through the lens of a latent variable model where concepts are treated as latent variables. This motivates a principled framework for concept extraction, namely Concept Component Analysis (ConCA), which aims to recover the log-posterior of each concept from LLM representations through a {unsupervised} linear unmixing process. We explore a specific variant, termed sparse ConCA, which leverages a sparsity prior to address the inherent ill-posedness of the unmixing problem. We implement 12 sparse ConCA variants and demonstrate their ability to extract meaningful concepts across multiple LLMs, offering theory-backed advantages over SAEs.

Related papers

UniCog: Uncovering Cognitive Abilities of LLMs through Latent Mind Space Analysis [69.50752734049985]
A growing body of research suggests that the cognitive processes of large language models (LLMs) differ fundamentally from those of humans.<n>We propose UniCog, a unified framework that analyzes LLM cognition via a latent mind space.
arXiv Detail & Related papers (2026-01-25T16:19:00Z)
Improving Latent Reasoning in LLMs via Soft Concept Mixing [5.230565644173722]
Large language models (LLMs) typically reason by generating discrete tokens.<n>We propose Soft Concept Mixing (SCM), a soft concept aware training scheme.<n>SCM exposes the model to soft representations during training.
arXiv Detail & Related papers (2025-11-21T01:43:28Z)
Interpretability Illusions with Sparse Autoencoders: Evaluating Robustness of Concept Representations [23.993903128858832]
We develop an evaluation framework featuring realistic scenarios in which adversarial perturbations are crafted to manipulate SAE representations.<n>We find that tiny adversarial input perturbations can effectively manipulate concept-based interpretations in most scenarios.<n>Overall, our results suggest that SAE concept representations are fragile and may be ill-suited for applications in model monitoring and oversight.
arXiv Detail & Related papers (2025-05-21T20:42:05Z)
A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models [50.34089812436633]
Large Language Models (LLMs) have transformed natural language processing, yet their internal mechanisms remain largely opaque.<n> mechanistic interpretability has attracted significant attention from the research community as a means to understand the inner workings of LLMs.<n>Sparse Autoencoders (SAEs) have emerged as a promising method due to their ability to disentangle the complex, superimposed features within LLMs into more interpretable components.
arXiv Detail & Related papers (2025-03-07T17:38:00Z)
Do Large Language Models Advocate for Inferentialism? [0.0]
The emergence of large language models (LLMs) such as ChatGPT and Claude presents new challenges for philosophy of language.<n>This paper explores Robert Brandom's inferential semantics as an alternative foundational framework for understanding these systems.
arXiv Detail & Related papers (2024-12-19T03:48:40Z)
Retrieval-Augmented Semantic Parsing: Improving Generalization with Lexical Knowledge [6.948555996661213]
We introduce Retrieval-Augmented Semantic Parsing (RASP), a simple yet effective approach that integrates external symbolic knowledge into the parsing process.<n>Our experiments show that LLMs outperform previous encoder-decoder baselines for semantic parsing.<n>RASP further enhances their ability to predict unseen concepts, nearly doubling the performance of previous models on out-of-distribution concepts.
arXiv Detail & Related papers (2024-12-13T15:30:20Z)
Provably Transformers Harness Multi-Concept Word Semantics for Efficient In-Context Learning [53.685764040547625]
Transformer-based large language models (LLMs) have displayed remarkable creative prowess and emergence capabilities.<n>This work provides a fine mathematical analysis to show how transformers leverage the multi-concept semantics of words to enable powerful ICL and excellent out-of-distribution ICL abilities.
arXiv Detail & Related papers (2024-11-04T15:54:32Z)
Sparsity-Guided Holistic Explanation for LLMs with Interpretable Inference-Time Intervention [53.896974148579346]
Large Language Models (LLMs) have achieved unprecedented breakthroughs in various natural language processing domains. The enigmatic black-box'' nature of LLMs remains a significant challenge for interpretability, hampering transparent and accountable applications. We propose a novel methodology anchored in sparsity-guided techniques, aiming to provide a holistic interpretation of LLMs.
arXiv Detail & Related papers (2023-12-22T19:55:58Z)
Interpreting Pretrained Language Models via Concept Bottlenecks [55.47515772358389]
Pretrained language models (PLMs) have made significant strides in various natural language processing tasks. The lack of interpretability due to their black-box'' nature poses challenges for responsible implementation. We propose a novel approach to interpreting PLMs by employing high-level, meaningful concepts that are easily understandable for humans.
arXiv Detail & Related papers (2023-11-08T20:41:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.