Related papers: NoLiMa: Long-Context Evaluation Beyond Literal Matching

NoLiMa: Long-Context Evaluation Beyond Literal Matching

URL: http://arxiv.org/abs/2502.05167v2
Date: Wed, 26 Mar 2025 13:23:30 GMT
Title: NoLiMa: Long-Context Evaluation Beyond Literal Matching
Authors: Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Trung Bui, Ryan A. Rossi, Seunghyun Yoon, Hinrich Schütze,
Abstract summary: NoLiMa is a benchmark extending the needle-in-a-haystack (NIAH) test.<n>It requires models to infer latent associations to locate the needle within the haystack.<n>We evaluate 12 popular large language models that claim to support contexts of at least 128K tokens.
Score: 100.00398424275501
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recent large language models (LLMs) support long contexts ranging from 128K to 1M tokens. A popular method for evaluating these capabilities is the needle-in-a-haystack (NIAH) test, which involves retrieving a "needle" (relevant information) from a "haystack" (long irrelevant context). Extensions of this approach include increasing distractors, fact chaining, and in-context reasoning. However, in these benchmarks, models can exploit existing literal matches between the needle and haystack to simplify the task. To address this, we introduce NoLiMa, a benchmark extending NIAH with a carefully designed needle set, where questions and needles have minimal lexical overlap, requiring models to infer latent associations to locate the needle within the haystack. We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 10 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%. Our analysis suggests these declines stem from the increased difficulty the attention mechanism faces in longer contexts when literal matches are absent, making it harder to retrieve relevant information. We publicly release the dataset and evaluation code at https://github.com/adobe-research/NoLiMa.

Related papers

Can LLMs reason over extended multilingual contexts? Towards long-context evaluation beyond retrieval and haystacks [22.859955360764275]
MLRBench is a synthetic benchmark for multilingual long-context reasoning. It is designed to be parallel, resistant to leakage, and scalable to arbitrary context lengths.
arXiv Detail & Related papers (2025-04-17T11:02:35Z)
Sequential-NIAH: A Needle-In-A-Haystack Benchmark for Extracting Sequential Needles from Long Contexts [23.076558892985986]
We introduce Sequential-NIAH, a benchmark designed to evaluate the capability of large language models to extract sequential information from long contexts. The benchmark includes contexts ranging from 8K to 128K tokens in length, with a dataset of 14,000 samples (2,000 reserved for testing)
arXiv Detail & Related papers (2025-04-07T03:50:12Z)
Pause-Tuning for Long-Context Comprehension: A Lightweight Approach to LLM Attention Recalibration [4.7429246847107835]
We introduce pause-tuning, a technique that redistributes attention to enhance comprehension of long-context inputs. Our approach involves fine-tuning language models on datasets with artificially inserted pause tokens. We evaluate pause-tuning against alternative approaches using the Needle-in-a-Haystack benchmark.
arXiv Detail & Related papers (2025-02-01T21:47:15Z)
What is Wrong with Perplexity for Long-context Language Modeling? [71.34933096461124]
Long-context inputs are crucial for large language models (LLMs) in tasks such as extended conversations, document summarization, and many-shot in-context learning. Perplexity (PPL) has proven unreliable for assessing long-context capabilities. We propose bfLongPPL, a novel metric that focuses on key tokens by employing a long-short context contrastive method to identify them.
arXiv Detail & Related papers (2024-10-31T09:39:28Z)
One Thousand and One Pairs: A "novel" challenge for long-context language models [56.60667988954638]
NoCha is a dataset of 1,001 pairs of true and false claims about 67 fictional books. Our annotators confirm that the largest share of pairs in NoCha require global reasoning over the entire book to verify. On average, models perform much better on pairs that require only sentence-level retrieval vs. global reasoning.
arXiv Detail & Related papers (2024-06-24T02:03:57Z)
BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack [4.3482088816575155]
We introduce the BABILong benchmark, designed to test language models' ability to reason across facts distributed in long documents. BABILong includes a diverse set of 20 reasoning tasks, including fact chaining, simple induction, deduction, counting, and handling lists/sets. Our evaluations show that popular LLMs effectively utilize only 10-20% of the context and their performance declines sharply with increased reasoning complexity.
arXiv Detail & Related papers (2024-06-14T16:00:29Z)
LLoCO: Learning Long Contexts Offline [63.3458260335454]
We propose LLoCO, a novel approach to processing long contexts. LLoCO learns contexts offline through context compression and in-domain parameter-efficient finetuning with LoRA. Our approach extends the effective context window of a 4k token LLaMA2-7B model to handle up to 128k tokens.
arXiv Detail & Related papers (2024-04-11T17:57:22Z)
RULER: What's the Real Context Size of Your Long-Context Language Models? [23.220973811374225]
We create a new benchmark for evaluating long-context language models (LMs) We evaluate 17 long-context LMs with 13 representative tasks in RULER. Almost all models exhibit large performance drops as the context length increases.
arXiv Detail & Related papers (2024-04-09T23:41:27Z)
Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks [76.43527940649939]
We introduce Ada-LEval, a benchmark for evaluating the long-context understanding of large language models (LLMs) Ada-LEval includes two challenging subsets, TSort and BestAnswer, which enable a more reliable evaluation of LLMs' long context capabilities. We evaluate 4 state-of-the-art closed-source API models and 6 open-source models with Ada-LEval.
arXiv Detail & Related papers (2024-04-09T17:30:48Z)
NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens [63.7488938083696]
NovelQA is a benchmark designed to test the capabilities of Large Language Models with extended texts. This paper presents the design and construction of NovelQA, highlighting its manual annotation, and diverse question types. Our evaluation of Long-context LLMs on NovelQA reveals significant insights into the models' performance.
arXiv Detail & Related papers (2024-03-18T17:32:32Z)
$\infty$Bench: Extending Long Context Evaluation Beyond 100K Tokens [64.08660301017302]
There is currently a lack of a standardized benchmark to evaluate this long-context capability. $infty$Bench is the first benchmark featuring an average data length surpassing 100K tokens. The results indicate that existing long context LLMs still require significant advancements to effectively process 100K+ context.
arXiv Detail & Related papers (2024-02-21T11:30:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.