Related papers: Sequential-NIAH: A Needle-In-A-Haystack Benchmark for Extracting Sequential Needles from Long Contexts

Sequential-NIAH: A Needle-In-A-Haystack Benchmark for Extracting Sequential Needles from Long Contexts

URL: http://arxiv.org/abs/2504.04713v3
Date: Sat, 20 Sep 2025 12:21:08 GMT
Title: Sequential-NIAH: A Needle-In-A-Haystack Benchmark for Extracting Sequential Needles from Long Contexts
Authors: Yifei Yu, Qian-Wen Zhang, Lingfeng Qiao, Di Yin, Fang Li, Jie Wang, Zengxi Chen, Suncong Zheng, Xiaolong Liang, Xing Sun,
Abstract summary: We introduce Sequential-NIAH, a benchmark designed to evaluate the capability of large language models to extract sequential information from long contexts.<n>The benchmark includes three needle generation pipelines: synthetic-temporal, real-temporal, and real-logical orders, with context lengths ranging from 8K to 128K.<n>We conducted experiments on six well-known LLMs, revealing that even the best-performing model achieved a maximum accuracy of only 63.50% on test set of this benchmark.
Score: 20.901983944214532
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Evaluating the ability of large language models (LLMs) to process lengthy contexts is critical, especially for retrieving query-relevant information embedded within them. We introduce Sequential-NIAH, a benchmark specifically designed to evaluate the capability of LLMs to extract sequential information items (known as \emph{needles}) from long contexts. The benchmark includes three needle generation pipelines: synthetic-temporal, real-temporal, and real-logical orders, with context lengths ranging from 8K to 128K, which comprises 14,000 samples (2,000 for testing). To facilitate the evaluation of this benchmark, we trained an evaluation model that assesses the correctness of LLM responses by comparing their completeness and sequential consistency against the ground truth, which provides a more reliable evaluation metric than GPT-4 or Claude. We conducted experiments on six well-known LLMs, revealing that even the best-performing model achieved a maximum accuracy of only 63.50% on test set of this benchmark. Further analysis highlights the growing challenges posed by increasing the context length or the number of needles, underscoring substantial room for improvement of LLMs. Additionally, noise analysis validates the reliability and challenge of the benchmark, making Sequential-NIAH an important reference for advancing research on long text information extraction capabilities of LLMs.

Related papers

Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements [78.87065404966002]
Existing benchmarks predominantly curate questions at the question level.<n>We propose Encyclo-K, a statement-based benchmark that rethinks benchmark construction from the ground up.
arXiv Detail & Related papers (2025-12-31T13:55:54Z)
IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis [60.32962597618861]
IDA-Bench is a novel benchmark evaluating large language models in multi-round interactive scenarios.<n>Agent performance is judged by comparing its final numerical output to the human-derived baseline.<n>Even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on 50% of the tasks, highlighting limitations not evident in single-turn tests.
arXiv Detail & Related papers (2025-05-23T09:37:52Z)
NoLiMa: Long-Context Evaluation Beyond Literal Matching [100.00398424275501]
NoLiMa is a benchmark extending the needle-in-a-haystack (NIAH) test. It requires models to infer latent associations to locate the needle within the haystack. We evaluate 12 popular large language models that claim to support contexts of at least 128K tokens.
arXiv Detail & Related papers (2025-02-07T18:49:46Z)
LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation [74.89981179257194]
LongProc (Long Procedural Generation) is a new benchmark for evaluating long-context language models (LCLMs) LongProc consists of six diverse procedural generation tasks, such as extracting structured information from HTML pages into a TSV format and executing complex search procedures to create travel plans. We evaluate 23 LCLMs, including instruction-tuned models and recent reasoning models, on LongProc at three difficulty levels, with the maximum number of output tokens set at 500, 2K, and 8K.
arXiv Detail & Related papers (2025-01-09T18:16:55Z)
HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly [34.205934899868346]
We introduce HELMET, a comprehensive benchmark encompassing seven diverse, application-centric categories.<n>We find that synthetic tasks like NIAH do not reliably predict downstream performance.<n>While most LCLMs achieve perfect NIAH scores, open-source models significantly lag behind closed ones when tasks require full-context reasoning.
arXiv Detail & Related papers (2024-10-03T17:20:11Z)
LongIns: A Challenging Long-context Instruction-based Exam for LLMs [20.8470269815788]
Long-context capabilities of large language models (LLMs) have been a hot topic in recent years.<n>We propose the LongIns benchmark dataset, a challenging long-context instruction-based exam for LLMs.
arXiv Detail & Related papers (2024-06-25T14:31:26Z)
One Thousand and One Pairs: A "novel" challenge for long-context language models [56.60667988954638]
NoCha is a dataset of 1,001 pairs of true and false claims about 67 fictional books. Our annotators confirm that the largest share of pairs in NoCha require global reasoning over the entire book to verify. On average, models perform much better on pairs that require only sentence-level retrieval vs. global reasoning.
arXiv Detail & Related papers (2024-06-24T02:03:57Z)
RULER: What's the Real Context Size of Your Long-Context Language Models? [23.220973811374225]
We create a new benchmark for evaluating long-context language models (LMs) We evaluate 17 long-context LMs with 13 representative tasks in RULER. Almost all models exhibit large performance drops as the context length increases.
arXiv Detail & Related papers (2024-04-09T23:41:27Z)
Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks [76.43527940649939]
We introduce Ada-LEval, a benchmark for evaluating the long-context understanding of large language models (LLMs) Ada-LEval includes two challenging subsets, TSort and BestAnswer, which enable a more reliable evaluation of LLMs' long context capabilities. We evaluate 4 state-of-the-art closed-source API models and 6 open-source models with Ada-LEval.
arXiv Detail & Related papers (2024-04-09T17:30:48Z)
ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models [25.74741863885925]
We propose a new benchmark for long-context LLMs focused on a practical meeting assistant scenario. Our benchmark, ELITR-Bench, augments the existing ELITR corpus by adding 271 manually crafted questions with their ground-truth answers. Our experiments with 12 long-context LLMs on ELITR-Bench confirm the progress made across successive generations of both proprietary and open models.
arXiv Detail & Related papers (2024-03-29T16:13:31Z)
LongHealth: A Question Answering Benchmark with Long Clinical Documents [36.05587855811346]
We present the LongHealth benchmark, comprising 20 detailed fictional patient cases across various diseases. The benchmark challenges LLMs with 400 multiple-choice questions in three categories: information extraction, negation, and sorting. We evaluated nine open-source LLMs with a minimum of 16,000 tokens and also included OpenAI's proprietary and cost-efficient GPT-3.5 Turbo for comparison.
arXiv Detail & Related papers (2024-01-25T19:57:00Z)
Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization [132.25202059478065]
We benchmark large language models (LLMs) on instruction controllable text summarization. Our study reveals that instruction controllable text summarization remains a challenging task for LLMs.
arXiv Detail & Related papers (2023-11-15T18:25:26Z)
Effective Long-Context Scaling of Foundation Models [90.57254298730923]
We present a series of long-context LLMs that support effective context windows of up to 32,768 tokens. Our models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks over Llama 2.
arXiv Detail & Related papers (2023-09-27T21:41:49Z)
L-Eval: Instituting Standardized Evaluation for Long Context Language Models [91.05820785008527]
We propose L-Eval to institute a more standardized evaluation for long context language models (LCLMs) We build a new evaluation suite containing 20 sub-tasks, 508 long documents, and over 2,000 human-labeled query-response pairs. Results show that popular n-gram matching metrics generally can not correlate well with human judgment.
arXiv Detail & Related papers (2023-07-20T17:59:41Z)
LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits. Most LLMs struggle on SummEdits, with performance close to random chance. The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.