Short-circuiting Shortcuts: Mechanistic Investigation of Shortcuts in Text Classification
- URL: http://arxiv.org/abs/2505.06032v1
- Date: Fri, 09 May 2025 13:26:21 GMT
- Title: Short-circuiting Shortcuts: Mechanistic Investigation of Shortcuts in Text Classification
- Authors: Leon Eshuijs, Shihan Wang, Antske Fokkens,
- Abstract summary: Reliance on spurious correlations (shortcuts) has been shown to underlie many of the successes of language models.<n>We investigate how shortcuts are actually processed within the model's decision-making mechanism.<n>We use actor names in movie reviews as controllable shortcuts with known impact on the outcome.
- Score: 2.262217900462841
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reliance on spurious correlations (shortcuts) has been shown to underlie many of the successes of language models. Previous work focused on identifying the input elements that impact prediction. We investigate how shortcuts are actually processed within the model's decision-making mechanism. We use actor names in movie reviews as controllable shortcuts with known impact on the outcome. We use mechanistic interpretability methods and identify specific attention heads that focus on shortcuts. These heads gear the model towards a label before processing the complete input, effectively making premature decisions that bypass contextual analysis. Based on these findings, we introduce Head-based Token Attribution (HTA), which traces intermediate decisions back to input tokens. We show that HTA is effective in detecting shortcuts in LLMs and enables targeted mitigation by selectively deactivating shortcut-related attention heads.
Related papers
- Overclocking LLM Reasoning: Monitoring and Controlling Thinking Path Lengths in LLMs [52.663816303997194]
A key factor influencing answer quality is the length of the thinking stage.<n>This paper explores and exploits the mechanisms by which LLMs understand and regulate the length of their reasoning.<n>Our results demonstrate that this "overclocking" method mitigates overthinking, improves answer accuracy, and reduces inference latency.
arXiv Detail & Related papers (2025-06-08T17:54:33Z) - Navigating the Shortcut Maze: A Comprehensive Analysis of Shortcut
Learning in Text Classification by Language Models [20.70050968223901]
This study addresses the overlooked impact of subtler, more complex shortcuts that compromise model reliability beyond oversimplified shortcuts.
We introduce a comprehensive benchmark that categorizes shortcuts into occurrence, style, and concept.
Our research systematically investigates models' resilience and susceptibilities to sophisticated shortcuts.
arXiv Detail & Related papers (2024-09-26T01:17:42Z) - Towards Faithful Explanations: Boosting Rationalization with Shortcuts Discovery [12.608345627859322]
We propose a shortcuts-fused Selective Rationalization (SSR) method, which boosts the rationalization by discovering and exploiting potential shortcuts.
Specifically, SSR first designs a shortcuts discovery approach to detect several potential shortcuts.
Then, by introducing the identified shortcuts, we propose two strategies to mitigate the problem of utilizing shortcuts to compose rationales.
arXiv Detail & Related papers (2024-03-12T07:24:17Z) - Investigating Multi-Hop Factual Shortcuts in Knowledge Editing of Large Language Models [18.005770232698566]
We first explore the existence of factual shortcuts through Knowledge Neurons.
We analyze the risks posed by factual shortcuts from the perspective of multi-hop knowledge editing.
arXiv Detail & Related papers (2024-02-19T07:34:10Z) - Navigating the OverKill in Large Language Models [84.62340510027042]
We investigate the factors for overkill by exploring how models handle and determine the safety of queries.
Our findings reveal the presence of shortcuts within models, leading to an over-attention of harmful words like 'kill' and prompts emphasizing safety will exacerbate overkill.
We introduce Self-Contrastive Decoding (Self-CD), a training-free and model-agnostic strategy, to alleviate this phenomenon.
arXiv Detail & Related papers (2024-01-31T07:26:47Z) - Discovering Highly Influential Shortcut Reasoning: An Automated
Template-Free Approach [10.609035331083218]
We propose a novel method for identifying shortcut reasoning.
The proposed method quantifies the severity of the shortcut reasoning by leveraging out-of-distribution data.
Our experiments on Natural Language Inference and Sentiment Analysis demonstrate that our framework successfully discovers known and unknown shortcut reasoning.
arXiv Detail & Related papers (2023-12-15T11:45:42Z) - Token-Level Adversarial Prompt Detection Based on Perplexity Measures
and Contextual Information [67.78183175605761]
Large Language Models are susceptible to adversarial prompt attacks.
This vulnerability underscores a significant concern regarding the robustness and reliability of LLMs.
We introduce a novel approach to detecting adversarial prompts at a token level.
arXiv Detail & Related papers (2023-11-20T03:17:21Z) - Improving Input-label Mapping with Demonstration Replay for In-context
Learning [67.57288926736923]
In-context learning (ICL) is an emerging capability of large autoregressive language models.
We propose a novel ICL method called Sliding Causal Attention (RdSca)
We show that our method significantly improves the input-label mapping in ICL demonstrations.
arXiv Detail & Related papers (2023-10-30T14:29:41Z) - Which Shortcut Solution Do Question Answering Models Prefer to Learn? [38.36299280464046]
Question answering (QA) models for reading comprehension tend to learn shortcut solutions rather than the solutions intended by QA datasets.
We show that shortcuts that exploit answer positions and word-label correlations are preferentially learned for extractive and multiple-choice QA.
We experimentally show that the learnability of shortcuts can be utilized to construct an effective QA training set.
arXiv Detail & Related papers (2022-11-29T13:57:59Z) - Backdoor Defense via Suppressing Model Shortcuts [91.30995749139012]
In this paper, we explore the backdoor mechanism from the angle of the model structure.
We demonstrate that the attack success rate (ASR) decreases significantly when reducing the outputs of some key skip connections.
arXiv Detail & Related papers (2022-11-02T15:39:19Z) - Shortcut Learning of Large Language Models in Natural Language
Understanding [119.45683008451698]
Large language models (LLMs) have achieved state-of-the-art performance on a series of natural language understanding tasks.
They might rely on dataset bias and artifacts as shortcuts for prediction.
This has significantly affected their generalizability and adversarial robustness.
arXiv Detail & Related papers (2022-08-25T03:51:39Z) - Why Machine Reading Comprehension Models Learn Shortcuts? [56.629192589376046]
We argue that larger proportion of shortcut questions in training data make models rely on shortcut tricks excessively.
A thorough empirical analysis shows that MRC models tend to learn shortcut questions earlier than challenging questions.
arXiv Detail & Related papers (2021-06-02T08:43:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.