Learning to Focus: Causal Attention Distillation via Gradient-Guided Token Pruning
- URL: http://arxiv.org/abs/2506.07851v2
- Date: Fri, 24 Oct 2025 03:13:05 GMT
- Title: Learning to Focus: Causal Attention Distillation via Gradient-Guided Token Pruning
- Authors: Yiju Guo, Wenkai Yang, Zexu Sun, Ning Ding, Zhiyuan Liu, Yankai Lin,
- Abstract summary: Large language models (LLMs) have demonstrated significant improvements in contextual understanding.<n>However, their ability to attend to truly critical information during long-context reasoning and generation still falls behind the pace.<n>We introduce a two-stage framework called Learning to Focus (LeaF) to mitigate confounding factors.
- Score: 62.23671919314693
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have demonstrated significant improvements in contextual understanding. However, their ability to attend to truly critical information during long-context reasoning and generation still falls behind the pace. Specifically, our preliminary experiments reveal that certain distracting patterns can misdirect the model's attention during inference, and removing these patterns substantially improves reasoning accuracy and generation quality. We attribute this phenomenon to spurious correlations in the training data, which obstruct the model's capacity to infer authentic causal instruction-response relationships. This phenomenon may induce redundant reasoning processes, potentially resulting in significant inference overhead and, more critically, the generation of erroneous or suboptimal responses. To mitigate this, we introduce a two-stage framework called Learning to Focus (LeaF) leveraging intervention-based inference to disentangle confounding factors. In the first stage, LeaF employs gradient-based comparisons with an advanced teacher to automatically identify confounding tokens based on causal relationships in the training corpus. Then, in the second stage, it prunes these tokens during distillation to enact intervention, aligning the student's attention with the teacher's focus distribution on truly critical context tokens. Experimental results demonstrate that LeaF not only achieves an absolute improvement in various mathematical reasoning, code generation and multi-hop question answering benchmarks but also effectively suppresses attention to confounding tokens during inference, yielding a more interpretable and reliable reasoning model.
Related papers
- Balancing Faithfulness and Performance in Reasoning via Multi-Listener Soft Execution [79.98699884805636]
Reasoning Execution by Multiple Listeners (REMUL) is a multi-party reinforcement learning approach.<n>REMUL builds on the hypothesis that reasoning traces which other parties can follow will be more faithful.<n>Speakers are rewarded for producing reasoning that is clear to listeners.
arXiv Detail & Related papers (2026-02-18T02:55:55Z) - Accordion-Thinking: Self-Regulated Step Summaries for Efficient and Readable LLM Reasoning [62.680551162054975]
We introduce an end-to-end framework where LLMs learn to self-regulate the granularity of the reasoning steps through dynamic summarization.<n>We apply reinforcement learning to incentivize this capability further, uncovering a critical insight: the accuracy gap between the highly efficient Fold mode and the exhaustive Unfold mode progressively narrows.<n>Our Accordion-Thinker demonstrates that with learned self-compression, LLMs can tackle complex reasoning tasks with minimal dependency token overhead.
arXiv Detail & Related papers (2026-02-03T08:34:20Z) - Scaling Reasoning Hop Exposes Weaknesses: Demystifying and Improving Hop Generalization in Large Language Models [66.36240676392502]
Chain-of-thought (CoT) reasoning has become the standard paradigm for enabling Large Language Models (LLMs) to solve complex problems.<n>Recent studies reveal a sharp performance drop in reasoning hop generalization scenarios.<n>We propose test-time correction of reasoning, a lightweight intervention method that dynamically identifies and deactivates ep heads in the reasoning process.
arXiv Detail & Related papers (2026-01-29T03:24:32Z) - Disentangling Learning from Judgment: Representation Learning for Open Response Analytics [0.0]
Open-ended responses are central to learning, yet automated scoring often conflates what students wrote with how teachers grade.<n>We present an analytics-first framework that separates content signals from rater tendencies, making judgments visible and auditable via analytics.
arXiv Detail & Related papers (2025-12-30T02:06:28Z) - Focused Chain-of-Thought: Efficient LLM Reasoning via Structured Input Information [41.10866361182172]
Focused Chain-of-Thought (F-CoT) separates information extraction from the reasoning process.<n>On arithmetic word problems, F-CoT reduces generated tokens by 2-3x while maintaining accuracy comparable to standard zero-shot CoT.
arXiv Detail & Related papers (2025-11-27T07:31:52Z) - Improving Temporal Understanding Logic Consistency in Video-Language Models via Attention Enhancement [44.654178762186824]
Large language models (LLMs) often generate self-contradictory outputs.<n>Video-language models (Video-LLMs) fail to provide consistent responses to logically rephrased questions.<n>We propose an attention enhancement method called Temporally Conditioned Attention Sharpening.
arXiv Detail & Related papers (2025-10-09T12:22:06Z) - A Closer Look at Bias and Chain-of-Thought Faithfulness of Large (Vision) Language Models [53.18562650350898]
Chain-of-thought (CoT) reasoning enhances performance of large language models.<n>We present the first comprehensive study of CoT faithfulness in large vision-language models.
arXiv Detail & Related papers (2025-05-29T18:55:05Z) - When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLMs [16.659986373052217]
Chain-of-thought reasoning can significantly degrade instruction-following accuracy.<n>This is the first work to systematically expose reasoning-induced failures in instruction-following.
arXiv Detail & Related papers (2025-05-16T16:36:00Z) - Don't Take Things Out of Context: Attention Intervention for Enhancing Chain-of-Thought Reasoning in Large Language Models [32.71672086718058]
Few-shot Chain-of-Thought (CoT) significantly enhances the reasoning capabilities of large language models (LLMs)<n>We observe that isolated segments, words, or tokens within CoT demonstrations can unexpectedly disrupt the generation process of LLMs.<n>We propose a Few-shot Attention Intervention method (FAI) that dynamically analyzes the attention patterns of demonstrations to accurately identify these tokens.
arXiv Detail & Related papers (2025-03-14T07:46:33Z) - Aggregation Artifacts in Subjective Tasks Collapse Large Language Models' Posteriors [74.04775677110179]
In-context Learning (ICL) has become the primary method for performing natural language tasks with Large Language Models (LLMs)<n>In this work, we examine whether this is the result of the aggregation used in corresponding datasets, where trying to combine low-agreement, disparate annotations might lead to annotation artifacts that create detrimental noise in the prompt.<n>Our results indicate that aggregation is a confounding factor in the modeling of subjective tasks, and advocate focusing on modeling individuals instead.
arXiv Detail & Related papers (2024-10-17T17:16:00Z) - Towards Context-Aware Emotion Recognition Debiasing from a Causal Demystification Perspective via De-confounded Training [14.450673163785094]
Context-Aware Emotion Recognition (CAER) provides valuable semantic cues for recognizing the emotions of target persons.
Current approaches invariably focus on designing sophisticated structures to extract perceptually critical representations from contexts.
We present a Contextual Causal Intervention Module (CCIM) to de-confound the confounder.
arXiv Detail & Related papers (2024-07-06T05:29:02Z) - Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales.
We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z) - Distributional Associations vs In-Context Reasoning: A Study of Feed-forward and Attention Layers [49.80959223722325]
We study the distinction between feed-forward and attention layers in large language models.<n>We find that feed-forward layers tend to learn simple distributional associations such as bigrams, while attention layers focus on in-context reasoning.
arXiv Detail & Related papers (2024-06-05T08:51:08Z) - Causal Concept Graph Models: Beyond Causal Opacity in Deep Learning [11.13665894783481]
Causal opacity denotes the difficulty in understanding the "hidden" causal structure underlying the decisions of deep neural network (DNN) models.<n>This work introduces Causal Concept Graph Models (Causal CGMs), a class of interpretable models whose decision-making process is causally transparent by design.<n>Our experiments show that Causal CGMs can: (i) match the generalisation performance of causally opaque models, (ii) enable human-in-the-loop corrections to mispredicted intermediate reasoning steps, and (iii) support the analysis of interventional and counterfactual scenarios.
arXiv Detail & Related papers (2024-05-26T10:15:20Z) - Identifying Semantic Induction Heads to Understand In-Context Learning [103.00463655766066]
We investigate whether attention heads encode two types of relationships between tokens present in natural languages.
We find that certain attention heads exhibit a pattern where, when attending to head tokens, they recall tail tokens and increase the output logits of those tail tokens.
arXiv Detail & Related papers (2024-02-20T14:43:39Z) - Unveiling the Magic: Investigating Attention Distillation in
Retrieval-augmented Generation [8.363702038073814]
Retrieval-augmented generation framework can address the limitations of large language models by enabling real-time knowledge updates for more accurate answers.
An efficient way in the training phase of retrieval-augmented models is attention distillation, which uses attention scores as a supervision signal instead of manually annotated query-document pairs.
arXiv Detail & Related papers (2024-02-19T02:48:44Z) - Using Early Readouts to Mediate Featural Bias in Distillation [30.5299408494168]
Deep networks tend to learn spurious feature-label correlations in real-world supervised learning tasks.
We propose a novel early readout mechanism whereby we attempt to predict the label using representations from earlier network layers.
arXiv Detail & Related papers (2023-10-28T04:58:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.