On Many-Shot In-Context Learning for Long-Context Evaluation
- URL: http://arxiv.org/abs/2411.07130v2
- Date: Wed, 26 Feb 2025 06:40:58 GMT
- Title: On Many-Shot In-Context Learning for Long-Context Evaluation
- Authors: Kaijian Zou, Muhammad Khalifa, Lu Wang,
- Abstract summary: This paper delves into long-context language model evaluation through many-shot ICL.<n>We develop metrics to categorize ICL tasks into two groups: similar-sample learning (SSL) and all-sample learning (ASL)<n>We find that while state-of-the-art models demonstrate good performance up to 64k tokens in SSL tasks, many models experience significant performance drops at only 16k tokens in ASL tasks.
- Score: 10.500629810624769
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many-shot in-context learning (ICL) has emerged as a unique setup to both utilize and test the ability of large language models to handle long context. This paper delves into long-context language model (LCLM) evaluation through many-shot ICL. We first ask: what types of ICL tasks benefit from additional demonstrations, and how effective are they in evaluating LCLMs? We find that classification and summarization tasks show performance improvements with additional demonstrations, while translation and reasoning tasks do not exhibit clear trends. Next, we investigate the extent to which different tasks necessitate retrieval versus global context understanding. We develop metrics to categorize ICL tasks into two groups: (i) similar-sample learning (SSL): tasks where retrieval of the most similar examples is sufficient for good performance, and (ii) all-sample learning (ASL): tasks that necessitate a deeper comprehension of all examples in the prompt. Lastly, we introduce a new many-shot ICL benchmark, MANYICLBENCH, to characterize model's ability on both fronts and benchmark 12 LCLMs using MANYICLBENCH. We find that while state-of-the-art models demonstrate good performance up to 64k tokens in SSL tasks, many models experience significant performance drops at only 16k tokens in ASL tasks.
Related papers
- Holistic Reasoning with Long-Context LMs: A Benchmark for Database Operations on Massive Textual Data [6.195658947075431]
We introduce HoloBench, a framework that brings database reasoning operations into text-based contexts.
We show that the amount of information in the context has a bigger influence on LCLM performance than the context length.
We find that tasks requiring the aggregation of multiple pieces of information show a noticeable drop in accuracy as context length increases.
arXiv Detail & Related papers (2024-10-15T19:04:13Z) - Instruction Tuning Vs. In-Context Learning: Revisiting Large Language Models in Few-Shot Computational Social Science [0.1499944454332829]
We evaluate the classification performance of large language models (LLMs) using in-context learning (ICL) and instruction tuning (IT)
ICL offers a rapid alternative for task adaptation by learning from examples without explicit gradient updates.
Our research highlights the significant advantages of ICL in handling CSS tasks in few-shot settings.
arXiv Detail & Related papers (2024-09-23T02:43:08Z) - DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels [89.51834016940153]
We introduce DetectiveQA, a narrative reasoning benchmark with an average context length of over 100K tokens.
We use detective novels as data sources, which naturally have various reasoning elements.
We manually annotated 600 questions in Chinese and then also provided an English edition of the context information and questions.
arXiv Detail & Related papers (2024-09-04T06:28:22Z) - Large Language Models Know What Makes Exemplary Contexts [42.90814615222177]
In-context learning (ICL) has proven to be a significant capability with the advancement of Large Language models (LLMs)
This paper presents a unified framework for LLMs that allows them to self-select influential in-context examples to compose their contexts.
arXiv Detail & Related papers (2024-08-14T12:32:41Z) - NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window? [37.64593022203498]
NeedleBench is a framework consisting of progressively more challenging tasks for assessing bilingual long-context capabilities.
We use the framework to assess how well the leading open-source models can identify key information relevant to the question.
We propose the Ancestral Trace Challenge to mimic the complexity of logical reasoning challenges that are likely to be present in real-world long-context tasks.
arXiv Detail & Related papers (2024-07-16T17:59:06Z) - Can Few-shot Work in Long-Context? Recycling the Context to Generate Demonstrations [44.24067814871803]
In-Context Learning (ICL) with few-shot examples may be an appealing solution to enhance LLM performance in long contexts.
We propose to automatically generate few-shot examples for long context QA tasks by recycling contexts.
arXiv Detail & Related papers (2024-06-19T15:28:29Z) - Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases.
We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning.
Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z) - Implicit In-context Learning [37.0562059811099]
We introduce Implicit In-context Learning (I2CL), an innovative paradigm that reduces the inference cost of ICL to that of zero-shot learning with minimal information loss.
I2CL achieves few-shot level performance at zero-shot inference cost, and it exhibits robustness against variations in demonstration examples.
arXiv Detail & Related papers (2024-05-23T14:57:52Z) - A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision [74.972172804514]
We introduce a multi-task Transformer model, CSLR2, that is able to ingest a signing sequence and output in a joint embedding space between signed language and spoken language text.
New dataset annotations provide continuous sign-level annotations for six hours of test videos, and will be made publicly available.
Our model significantly outperforms the previous state of the art on both tasks.
arXiv Detail & Related papers (2024-05-16T17:19:06Z) - Does In-Context Learning Really Learn? Rethinking How Large Language Models Respond and Solve Tasks via In-Context Learning [41.606494950216764]
In-context Learning (ICL) has emerged as a powerful capability alongside the development of scaled-up large language models (LLMs)
This paper decomposes the overall performance of ICL into three dimensions, label space, format, and discrimination.
We show that ICL exhibits significant efficacy in regulating the label space and format, which helps LLMs respond to desired label words.
arXiv Detail & Related papers (2024-04-11T08:20:10Z) - CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models [58.95889895912716]
We introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension.
Our findings indicate that MLLMs consistently fall short of human performance on this benchmark.
This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner.
arXiv Detail & Related papers (2024-02-21T08:21:12Z) - LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language
Models [56.25156596019168]
This paper introduces the LMRL-Gym benchmark for evaluating multi-turn RL for large language models (LLMs)
Our benchmark consists of 8 different language tasks, which require multiple rounds of language interaction and cover a range of tasks in open-ended dialogue and text games.
arXiv Detail & Related papers (2023-11-30T03:59:31Z) - Hint-enhanced In-Context Learning wakes Large Language Models up for knowledge-intensive tasks [54.153914606302486]
In-context learning (ICL) ability has emerged with the increasing scale of large language models (LLMs)
We propose a new paradigm called Hint-enhanced In-Context Learning (HICL) to explore the power of ICL in open-domain question answering.
arXiv Detail & Related papers (2023-11-03T14:39:20Z) - Can Large Language Models Understand Real-World Complex Instructions? [54.86632921036983]
Large language models (LLMs) can understand human instructions, but struggle with complex instructions.
Existing benchmarks are insufficient to assess LLMs' ability to understand complex instructions.
We propose CELLO, a benchmark for evaluating LLMs' ability to follow complex instructions systematically.
arXiv Detail & Related papers (2023-09-17T04:18:39Z) - Learning to Retrieve In-Context Examples for Large Language Models [69.9707552694766]
Large language models (LLMs) have demonstrated their ability to learn in-context.
The effectiveness of in-context learning is heavily reliant on the quality of the selected examples.
We propose a novel framework to iteratively train dense retrievers that can identify high-quality in-context examples.
arXiv Detail & Related papers (2023-07-14T05:23:08Z) - Iterative Forward Tuning Boosts In-Context Learning in Language Models [88.25013390669845]
In this study, we introduce a novel two-stage framework to boost in-context learning in large language models (LLMs)
Specifically, our framework delineates the ICL process into two distinct stages: Deep-Thinking and test stages.
The Deep-Thinking stage incorporates a unique attention mechanism, i.e., iterative enhanced attention, which enables multiple rounds of information accumulation.
arXiv Detail & Related papers (2023-05-22T13:18:17Z) - ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for
Document Information Extraction [56.790794611002106]
Large language models (LLMs) have demonstrated remarkable results in various natural language processing (NLP) tasks with in-context learning.
We propose a simple but effective in-context learning framework called ICL-D3IE.
Specifically, we extract the most difficult and distinct segments from hard training documents as hard demonstrations.
arXiv Detail & Related papers (2023-03-09T06:24:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.