Related papers: GIM: Improved Interpretability for Large Language Models

GIM: Improved Interpretability for Large Language Models

URL: http://arxiv.org/abs/2505.17630v1
Date: Fri, 23 May 2025 08:41:45 GMT
Title: GIM: Improved Interpretability for Large Language Models
Authors: Joakim Edin, Róbert Csordás, Tuukka Ruotsalo, Zhengxuan Wu, Maria Maistro, Jing Huang, Lars Maaløe,
Abstract summary: Self-repair is a phenomenon where networks compensate for reduced signal in one component by amplifying others.<n>We introduce Gradient Interaction Modifications (GIM), a technique that accounts for self-repair during backpropagation.
Score: 23.12421433871512
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Ensuring faithful interpretability in large language models is imperative for trustworthy and reliable AI. A key obstacle is self-repair, a phenomenon where networks compensate for reduced signal in one component by amplifying others, masking the true importance of the ablated component. While prior work attributes self-repair to layer normalization and back-up components that compensate for ablated components, we identify a novel form occurring within the attention mechanism, where softmax redistribution conceals the influence of important attention scores. This leads traditional ablation and gradient-based methods to underestimate the significance of all components contributing to these attention scores. We introduce Gradient Interaction Modifications (GIM), a technique that accounts for self-repair during backpropagation. Extensive experiments across multiple large language models (Gemma 2B/9B, LLAMA 1B/3B/8B, Qwen 1.5B/3B) and diverse tasks demonstrate that GIM significantly improves faithfulness over existing circuit identification and feature attribution methods. Our work is a significant step toward better understanding the inner mechanisms of LLMs, which is crucial for improving them and ensuring their safety. Our code is available at https://github.com/JoakimEdin/gim.

Related papers

Forgetting: A New Mechanism Towards Better Large Language Model Fine-tuning [53.398270878295754]
Supervised fine-tuning (SFT) plays a critical role for pretrained large language models (LLMs)<n>We suggest categorizing tokens within each corpus into two parts -- positive and negative tokens -- based on whether they are useful to improve model performance.<n>We conduct experiments on well-established benchmarks, finding that this forgetting mechanism not only improves overall model performance and also facilitate more diverse model responses.
arXiv Detail & Related papers (2025-08-06T11:22:23Z)
Detection Transformers Under the Knife: A Neuroscience-Inspired Approach to Ablations [5.5967570276373655]
We systematically analyze the impact of ablating key components in three state-of-the-art detection transformer models.<n>We evaluate the effects of these ablations on the performance metrics gIoU and F1-score.<n>This study advances XAI for DETRs by clarifying the contributions of internal components to model performance.
arXiv Detail & Related papers (2025-07-29T12:00:08Z)
Failure by Interference: Language Models Make Balanced Parentheses Errors When Faulty Mechanisms Overshadow Sound Ones [13.381339115567288]
Language models (LMs) still struggle with simple syntactic tasks such as generating balanced parentheses.<n>Our study reveals that LMs rely on a number of components that independently make their own predictions.<n>We introduce RASteer, a steering method to systematically identify and increase the contribution of reliable components for improving model performance.
arXiv Detail & Related papers (2025-06-30T23:35:19Z)
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free [81.65559031466452]
We conduct experiments to investigate gating-augmented softmax attention variants.<n>We find that a simple modification-applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)-consistently improves performance.
arXiv Detail & Related papers (2025-05-10T17:15:49Z)
Any Image Restoration via Efficient Spatial-Frequency Degradation Adaptation [158.37640586809187]
Restoring any degraded image efficiently via just one model has become increasingly significant.<n>Our approach, termed AnyIR, takes a unified path that leverages inherent similarity across various degradations.<n>To fuse the degradation awareness and the contextualized attention, a spatial-frequency parallel fusion strategy is proposed.
arXiv Detail & Related papers (2025-04-19T09:54:46Z)
Fraesormer: Learning Adaptive Sparse Transformer for Efficient Food Recognition [9.83509397800422]
We propose an adaptive and efficient sparse Transformer architecture (Fraesormer) with two core designs.<n>ATK-SPA uses a learnable Gated Dynamic Top-K Operator (GDTKO) to retain critical attention scores.<n>HSSFGN employs gating mechanism to achieve multi-scale feature representation.
arXiv Detail & Related papers (2025-03-15T05:13:26Z)
Core Context Aware Attention for Long Context Language Modeling [50.774702091154204]
We propose a plug-and-play Core Context Aware (CCA) Attention for efficient long-range context modeling.<n>Our CCA-Attention significantly outperforms state-of-the-art models in terms of computational efficiency and long-context modeling ability.
arXiv Detail & Related papers (2024-12-17T01:54:08Z)
Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control [44.326363467045496]
Large Language Models (LLMs) have become a critical area of research in Reinforcement Learning from Human Feedback (RLHF) representation engineering offers a new, training-free approach. This technique leverages semantic features to control the representation of LLM's intermediate hidden states. It is difficult to encode various semantic contents, like honesty and safety, into a singular semantic feature.
arXiv Detail & Related papers (2024-11-04T08:36:03Z)
CITI: Enhancing Tool Utilizing Ability in Large Language Models without Sacrificing General Performance [17.723293304671877]
We propose a Component-based Tool-utilizing ability Injection method (CITI) According to the gradient-based importance score of different components, CITI alleviates the capability conflicts caused by fine-tuning process. Experimental results demonstrate that our approach achieves outstanding performance across a range of evaluation metrics.
arXiv Detail & Related papers (2024-09-20T04:06:28Z)
The Mechanics of Conceptual Interpretation in GPT Models: Interpretative Insights [10.777646083061395]
We introduce concept editing'', an innovative variation of knowledge editing that uncovers conceptualisation mechanisms within large language models. We analyse the Multi-Layer Perceptron (MLP), Multi-Head Attention (MHA), and hidden state components of transformer models. Our work highlights the complex, layered nature of semantic processing in LLMs and the challenges of isolating and modifying specific concepts within these models.
arXiv Detail & Related papers (2024-08-05T18:50:08Z)
Tuning-Free Accountable Intervention for LLM Deployment -- A Metacognitive Approach [55.613461060997004]
Large Language Models (LLMs) have catalyzed transformative advances across a spectrum of natural language processing tasks. We propose an innovative textitmetacognitive approach, dubbed textbfCLEAR, to equip LLMs with capabilities for self-aware error identification and correction.
arXiv Detail & Related papers (2024-03-08T19:18:53Z)
A Context-Aware Feature Fusion Framework for Punctuation Restoration [28.38472792385083]
We propose a novel Feature Fusion framework based on two-type Attentions (FFA) to alleviate the shortage of attention. Experiments on the popular benchmark dataset IWSLT demonstrate that our approach is effective.
arXiv Detail & Related papers (2022-03-23T15:29:28Z)
Is Attention Better Than Matrix Decomposition? [58.813382406412195]
We show that self-attention is not better than the matrix decomposition model for encoding long-distance dependencies. We propose a series of Hamburgers, in which we employ the optimization algorithms for solving MDs to factorize the input representations into sub-matrices and reconstruct a low-rank embedding. Comprehensive experiments are conducted in the vision tasks where it is crucial to learn the global context.
arXiv Detail & Related papers (2021-09-09T20:40:19Z)
Self-Attention Attribution: Interpreting Information Interactions Inside Transformer [89.21584915290319]
We propose a self-attention attribution method to interpret the information interactions inside Transformer. We show that the attribution results can be used as adversarial patterns to implement non-targeted attacks towards BERT.
arXiv Detail & Related papers (2020-04-23T14:58:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.