MateICL: Mitigating Attention Dispersion in Large-Scale In-Context Learning
- URL: http://arxiv.org/abs/2505.01110v1
- Date: Fri, 02 May 2025 08:45:45 GMT
- Title: MateICL: Mitigating Attention Dispersion in Large-Scale In-Context Learning
- Authors: Murtadha Ahmed, Wenbo, Liu yunfeng,
- Abstract summary: We introduce Mitigating Attention Dispersion in large-scale ICL (MateICL)<n>We show that MateICL can effectively leverage larger contexts to improve ICL performance.<n>Despite advances in inference strategies, our results demonstrate that MateICL remains beneficial in computationally resource-constrained settings.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in In-Context Learning (ICL). However, the fixed position length constraints in pre-trained models limit the number of demonstration examples. Recent efforts to extend context suffer from attention dispersion as the number of demonstrations increases. In this paper, we introduce Mitigating Attention Dispersion in large-scale ICL (MateICL) that enables LLMs to maintain effective self-attention as the context size grows. We first split the context into multiple windows, each filled to the model's context capacity, which are processed separately. Then, we introduce an additional layer to recalibrate the attention weights, prioritizing the query tokens as the number of demonstrations increases. Our empirical results show that MateICL can effectively leverage larger contexts to improve ICL performance. Compared to retrieval-based baselines, MateICL consistently achieves better performance without requiring an externally trained retrieval model. Despite recent advances in inference strategies (e.g., 32k token contexts), our results demonstrate that MateICL remains beneficial in computationally resource-constrained settings. The code is publicly available at https://github.com/amurtadha/MateICL.
Related papers
- END: Early Noise Dropping for Efficient and Effective Context Denoising [60.24648712022382]
Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks.<n>They are often distracted by irrelevant or noisy context in input sequences that degrades output quality.<n>We introduce Early Noise Dropping (textscEND), a novel approach to mitigate this issue without requiring fine-tuning the LLMs.
arXiv Detail & Related papers (2025-02-26T08:07:17Z) - Core Context Aware Transformers for Long Context Language Modeling [50.774702091154204]
We propose a plug-and-play Core Context Aware (CCA) Attention for efficient long-context modeling.<n>Our method automatically focuses and strengthens core context while diminishing redundancy during the learning process.<n>Our method is able to replace the self-attention module in existing Large Language Models with minimal fine-tuning cost.
arXiv Detail & Related papers (2024-12-17T01:54:08Z) - LLMs Are In-Context Bandit Reinforcement Learners [30.192422586838997]
Large Language Models (LLMs) excel at in-context learning (ICL), a supervised learning technique that relies on adding annotated examples to the model context.<n>We investigate a contextual bandit version of in-context reinforcement learning (ICRL), where models learn in-context, online, from external reward, instead of supervised data.
arXiv Detail & Related papers (2024-10-07T17:45:00Z) - Are LLMs Good Annotators for Discourse-level Event Relation Extraction? [15.365993658296016]
We assess the effectiveness of Large Language Models (LLMs) in addressing discourse-level event relation extraction tasks.<n> Evaluation is conducted using an commercial model, GPT-3.5, and an open-source model, LLaMA-2.
arXiv Detail & Related papers (2024-07-28T19:27:06Z) - In-Context Learning with Long-Context Models: An In-Depth Exploration [92.16922648612807]
We show that, for many datasets with large label spaces, performance continues to increase with thousands of demonstrations.<n>We show that long-context ICL can be an effective tool, and may not require long-context for encoding the demonstration set at all.
arXiv Detail & Related papers (2024-04-30T21:06:52Z) - ParaICL: Towards Robust Parallel In-Context Learning [74.38022919598443]
Large language models (LLMs) have become the norm in natural language processing.
Few-shot in-context learning (ICL) relies on the choice of few-shot demonstration examples.
We propose a novel method named parallel in-context learning (ParaICL)
arXiv Detail & Related papers (2024-03-31T05:56:15Z) - Naive Bayes-based Context Extension for Large Language Models [2.743675474582704]
We introduce a novel framework, called Naive Bayes-based Context Extension (NBCE)
NBCE enables existing Large Language Models (LLMs) to perform In-Context Learning (ICL) with an increased number of demonstrations.
NBCE substantially enhances performance, particularly as the number of demonstration examples increases.
arXiv Detail & Related papers (2024-03-26T09:59:45Z) - Not All Demonstration Examples are Equally Beneficial: Reweighting
Demonstration Examples for In-Context Learning [32.29118942982609]
Large Language Models (LLMs) have recently gained the In-Context Learning (ICL) ability with the models scaling up.
This paper investigates how to determine approximately optimal weights for demonstration examples and how to apply them during ICL.
Experimental results on 8 text classification tasks show that our approach outperforms conventional ICL by a large margin.
arXiv Detail & Related papers (2023-10-12T13:15:11Z) - Amortizing intractable inference in large language models [56.92471123778389]
We use amortized Bayesian inference to sample from intractable posterior distributions.
We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training.
As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem.
arXiv Detail & Related papers (2023-10-06T16:36:08Z) - Explaining Emergent In-Context Learning as Kernel Regression [61.57151500616111]
Large language models (LLMs) have initiated a paradigm shift in transfer learning.
In this paper, we investigate the reason why a transformer-based language model can accomplish in-context learning after pre-training.
We find that during ICL, the attention and hidden features in LLMs match the behaviors of a kernel regression.
arXiv Detail & Related papers (2023-05-22T06:45:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.