NoLiMa: Long-Context Evaluation Beyond Literal Matching
- URL: http://arxiv.org/abs/2502.05167v1
- Date: Fri, 07 Feb 2025 18:49:46 GMT
- Title: NoLiMa: Long-Context Evaluation Beyond Literal Matching
- Authors: Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Trung Bui, Ryan A. Rossi, Seunghyun Yoon, Hinrich Schütze,
- Abstract summary: Recent large language models (LLMs) support contexts ranging from 128K to 1M tokens.
We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens.
While they perform well in short contexts, performance degrades significantly as context length increases.
- Score: 100.00398424275501
- License:
- Abstract: Recent large language models (LLMs) support long contexts ranging from 128K to 1M tokens. A popular method for evaluating these capabilities is the needle-in-a-haystack (NIAH) test, which involves retrieving a "needle" (relevant information) from a "haystack" (long irrelevant context). Extensions of this approach include increasing distractors, fact chaining, and in-context reasoning. However, in these benchmarks, models can exploit existing literal matches between the needle and haystack to simplify the task. To address this, we introduce NoLiMa, a benchmark extending NIAH with a carefully designed needle set, where questions and needles have minimal lexical overlap, requiring models to infer latent associations to locate the needle within the haystack. We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 10 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%. Our analysis suggests these declines stem from the increased difficulty the attention mechanism faces in longer contexts when literal matches are absent, making it harder to retrieve relevant information.
Related papers
- Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks? [36.83397306207386]
We evaluate the capabilities of 17 leading Large Language Models (LLMs)
Strikingly, many models are remarkably threadsafe: capable of simultaneously following multiple threads without significant loss in performance.
We find the effective context limit is significantly shorter than the supported context length, with accuracy decreasing as the context window grows.
arXiv Detail & Related papers (2024-11-07T18:59:27Z) - What is Wrong with Perplexity for Long-context Language Modeling? [71.34933096461124]
Long-context inputs are crucial for large language models (LLMs) in tasks such as extended conversations, document summarization, and many-shot in-context learning.
Perplexity (PPL) has proven unreliable for assessing long-context capabilities.
We propose bfLongPPL, a novel metric that focuses on key tokens by employing a long-short context contrastive method to identify them.
arXiv Detail & Related papers (2024-10-31T09:39:28Z) - One Thousand and One Pairs: A "novel" challenge for long-context language models [56.60667988954638]
NoCha is a dataset of 1,001 pairs of true and false claims about 67 fictional books.
Our annotators confirm that the largest share of pairs in NoCha require global reasoning over the entire book to verify.
On average, models perform much better on pairs that require only sentence-level retrieval vs. global reasoning.
arXiv Detail & Related papers (2024-06-24T02:03:57Z) - BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack [4.3482088816575155]
We introduce the BABILong benchmark, designed to test language models' ability to reason across facts distributed in long documents.
BABILong includes a diverse set of 20 reasoning tasks, including fact chaining, simple induction, deduction, counting, and handling lists/sets.
Our evaluations show that popular LLMs effectively utilize only 10-20% of the context and their performance declines sharply with increased reasoning complexity.
arXiv Detail & Related papers (2024-06-14T16:00:29Z) - LINGOLY: A Benchmark of Olympiad-Level Linguistic Reasoning Puzzles in Low-Resource and Extinct Languages [8.754506364968394]
The LingOly benchmark is a novel benchmark for advanced reasoning abilities in large language models.
We evaluate capabilities for in-context identification and generalisation of linguistic patterns in very low-resource or extinct languages.
We assess performance with both direct accuracy and comparison to a no-context baseline to penalise memorisation.
arXiv Detail & Related papers (2024-06-10T11:50:29Z) - RULER: What's the Real Context Size of Your Long-Context Language Models? [23.220973811374225]
We create a new benchmark for evaluating long-context language models (LMs)
We evaluate 17 long-context LMs with 13 representative tasks in RULER.
Almost all models exhibit large performance drops as the context length increases.
arXiv Detail & Related papers (2024-04-09T23:41:27Z) - NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens [63.7488938083696]
NovelQA is a benchmark designed to test the capabilities of Large Language Models with extended texts.
This paper presents the design and construction of NovelQA, highlighting its manual annotation, and diverse question types.
Our evaluation of Long-context LLMs on NovelQA reveals significant insights into the models' performance.
arXiv Detail & Related papers (2024-03-18T17:32:32Z) - $\infty$Bench: Extending Long Context Evaluation Beyond 100K Tokens [64.08660301017302]
There is currently a lack of a standardized benchmark to evaluate this long-context capability.
$infty$Bench is the first benchmark featuring an average data length surpassing 100K tokens.
The results indicate that existing long context LLMs still require significant advancements to effectively process 100K+ context.
arXiv Detail & Related papers (2024-02-21T11:30:29Z) - LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K [46.966627564849944]
This paper introduces LV-Eval, a challenging long-context benchmark with five length levels reaching up to 256k words.
The advantages of LV-Eval include controllable evaluation across different context lengths, challenging test instances with confusing facts, and more objective evaluations.
arXiv Detail & Related papers (2024-02-06T13:11:19Z) - Making Retrieval-Augmented Language Models Robust to Irrelevant Context [55.564789967211844]
An important desideratum of RALMs, is that retrieved information helps model performance when it is relevant.
Recent work has shown that retrieval augmentation can sometimes have a negative effect on performance.
arXiv Detail & Related papers (2023-10-02T18:52:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.