Related papers: LLMs Explain't: A Post-Mortem on Semantic Interpretability in Transformer Models

LLMs Explain't: A Post-Mortem on Semantic Interpretability in Transformer Models

URL: http://arxiv.org/abs/2601.22928v1
Date: Fri, 30 Jan 2026 12:46:37 GMT
Title: LLMs Explain't: A Post-Mortem on Semantic Interpretability in Transformer Models
Authors: Alhassan Abdelhalim, Janick Edinger, Sören Laue, Michaela Regneri,
Abstract summary: Large Language Models (LLMs) are becoming increasingly popular in pervasive computing due to their versatility and strong performance.<n>This paper investigates how linguistic abstraction emerges in LLMs, aiming to detect it across different modules.<n> Attention-based explanations collapsed once we tested the core assumption that later-layer representations still correspond to tokens.<n>Property-inference methods applied to embeddings also failed because their high predictive scores were driven by methodological artifacts and dataset structure.
Score: 3.7965260744113163
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are becoming increasingly popular in pervasive computing due to their versatility and strong performance. However, despite their ubiquitous use, the exact mechanisms underlying their outstanding performance remain unclear. Different methods for LLM explainability exist, and many are, as a method, not fully understood themselves. We started with the question of how linguistic abstraction emerges in LLMs, aiming to detect it across different LLM modules (attention heads and input embeddings). For this, we used methods well-established in the literature: (1) probing for token-level relational structures, and (2) feature-mapping using embeddings as carriers of human-interpretable properties. Both attempts failed for different methodological reasons: Attention-based explanations collapsed once we tested the core assumption that later-layer representations still correspond to tokens. Property-inference methods applied to embeddings also failed because their high predictive scores were driven by methodological artifacts and dataset structure rather than meaningful semantic knowledge. These failures matter because both techniques are widely treated as evidence for what LLMs supposedly understand, yet our results show such conclusions are unwarranted. These limitations are particularly relevant in pervasive and distributed computing settings where LLMs are deployed as system components and interpretability methods are relied upon for debugging, compression, and explaining models.

Related papers

Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs [100.02824137397464]
We investigate how Large Language Models adapt their internal representations when encountering inputs of increasing difficulty.<n>We reveal a consistent and quantifiable phenomenon: as task difficulty increases, the last hidden states of LLMs become substantially sparser.<n>This sparsity--difficulty relation is observable across diverse models and domains.
arXiv Detail & Related papers (2026-03-03T18:48:15Z)
Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation [59.40886078302025]
Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in aligning visual inputs with natural language outputs.<n>Yet, the extent to which generated tokens depend on visual modalities remains poorly understood.<n>We present a lightweight black-box framework for explaining autoregressive token generation in MLLMs.
arXiv Detail & Related papers (2025-09-26T15:38:42Z)
Computation Mechanism Behind LLM Position Generalization [59.013857707250814]
Large language models (LLMs) exhibit flexibility in handling textual positions.<n>They can understand texts with position perturbations and generalize to longer texts.<n>This work connects the linguistic phenomenon with LLMs' computational mechanisms.
arXiv Detail & Related papers (2025-03-17T15:47:37Z)
Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders [29.356200147371275]
Large language models (LLMs) excel at handling human queries, but they can occasionally generate flawed or unexpected responses.<n>We propose using a fixed vocabulary set for feature interpretations and designing a mutual information-based objective.<n>We propose two runtime steering strategies that adjust the learned feature activations based on their corresponding explanations.
arXiv Detail & Related papers (2025-02-21T16:36:42Z)
PromptExp: Multi-granularity Prompt Explanation of Large Language Models [16.259208045898415]
We introduce PromptExp, a framework for multi-granularity prompt explanations by aggregating token-level insights. PromptExp supports both white-box and black-box explanations and extends explanations to higher granularity levels. We evaluate PromptExp in case studies such as sentiment analysis, showing the perturbation-based approach performs best.
arXiv Detail & Related papers (2024-10-16T22:25:15Z)
Traffic Light or Light Traffic? Investigating Phrasal Semantics in Large Language Models [41.233879429714925]
This study critically examines the capacity of API-based large language models to comprehend phrase semantics. We assess the performance of LLMs in executing phrase semantic reasoning tasks guided by natural language instructions. We conduct detailed error analyses to interpret the limitations faced by LLMs in comprehending phrase semantics.
arXiv Detail & Related papers (2024-10-03T08:44:17Z)
Misinforming LLMs: vulnerabilities, challenges and opportunities [4.54019093815234]
Large Language Models (LLMs) have made significant advances in natural language processing, but their underlying mechanisms are often misunderstood. This paper argues that current LLM architectures are inherently untrustworthy due to their reliance on correlations of sequential patterns of word embedding vectors. Research into combining generative transformer-based models with fact bases and logic programming languages may lead to the development of trustworthy LLMs.
arXiv Detail & Related papers (2024-08-02T10:35:49Z)
Do LLMs Really Adapt to Domains? An Ontology Learning Perspective [2.0755366440393743]
Large Language Models (LLMs) have demonstrated unprecedented prowess across various natural language processing tasks in various application domains. Recent studies show that LLMs can be leveraged to perform lexical semantic tasks, such as Knowledge Base Completion (KBC) or Ontology Learning (OL) This paper investigates the question: Do LLMs really adapt to domains and remain consistent in the extraction of structured knowledge, or do they only learn lexical senses instead of reasoning?
arXiv Detail & Related papers (2024-07-29T13:29:43Z)
Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales. We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z)
FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition [56.76951887823882]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks. We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z)
Uncertainty Quantification for In-Context Learning of Large Language Models [52.891205009620364]
In-context learning has emerged as a groundbreaking ability of Large Language Models (LLMs) We propose a novel formulation and corresponding estimation method to quantify both types of uncertainties. The proposed method offers an unsupervised way to understand the prediction of in-context learning in a plug-and-play fashion.
arXiv Detail & Related papers (2024-02-15T18:46:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.