Related papers: A Controllable Examination for Long-Context Language Models

A Controllable Examination for Long-Context Language Models

URL: http://arxiv.org/abs/2506.02921v1
Date: Tue, 03 Jun 2025 14:23:06 GMT
Title: A Controllable Examination for Long-Context Language Models
Authors: Yijun Yang, Zeyu Huang, Wenhao Zhu, Zihan Qiu, Fei Yuan, Jeff Z. Pan, Ivan Titov,
Abstract summary: This study introduces $textbfLongBioBench, a novel benchmark for evaluating long-context language models.<n>We show that most models still exhibit deficiencies in semantic understanding and elementary reasoning.<n>LongBioBench achieves a better trade-off between mirroring authentic language tasks and maintaining controllability.
Score: 45.47345679278309
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing frameworks for evaluating long-context language models (LCLM) can be broadly categorized into real-world and synthetic tasks. Despite their utility, both approaches are accompanied by certain intrinsic limitations. Real-world tasks are too complex to interpret or characterize and are susceptible to data contamination. In contrast, synthetic tasks often adopt the needle-in-the-haystack (NIAH) format, wherein a lack of coherence between the "needle" and the "haystack" compromises their validity as proxies for realistic applications. In response to these challenges, we posit that an ideal long-context evaluation framework should be characterized by three essential features: $\textit{seamless context}$, $\textit{controllable setting}$, and $\textit{sound evaluation}$. This study introduces $\textbf{LongBioBench}$, a novel benchmark that utilizes artificially generated biographies as a controlled environment for assessing LCLMs across dimensions of $\textit{understanding}$, $\textit{reasoning}$, and $\textit{trustworthiness}$. Our experimental evaluation, which includes $\textbf{18}$ LCLMs in total, demonstrates that most models still exhibit deficiencies in semantic understanding and elementary reasoning over retrieved results and are less trustworthy as context length increases. Our further analysis indicates some design choices employed by existing synthetic benchmarks, such as contextual non-coherence, numerical needles, and the absence of distractors, rendering them vulnerable to test the model long-context capabilities. Moreover, we also reveal that long-context continual pretraining primarily adjusts RoPE embedding to accommodate extended context lengths. To sum up, compared to previous synthetic benchmarks, LongBioBench achieves a better trade-off between mirroring authentic language tasks and maintaining controllability, and is highly interpretable and configurable.

Related papers

LongWeave: A Long-Form Generation Benchmark Bridging Real-World Relevance and Verifiability [60.451734326001564]
We introduce textbfLongWeave, which balances real-world and verifiable assessment with Constraint-Verifier Evaluation (CoV-Eval)<n>LongWeave supports customizable input/output lengths (up to 64K/8K tokens) across seven distinct tasks.<n> Evaluation on 23 Large Language Models shows that even state-of-the-art models encounter significant challenges in long-form generation as real-world complexity and output length increase.
arXiv Detail & Related papers (2025-10-28T12:11:12Z)
100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability? [28.694112253150983]
Real-task-based long-context evaluation benchmarks have two major shortcomings.<n> benchmarks like LongBench often do not provide proper metrics to separate long-context performance from the model's baseline ability.<n>We introduce a length-controllable long-context benchmark and a novel metric that disentangles baseline knowledge from true long-context capabilities.
arXiv Detail & Related papers (2025-05-25T19:58:31Z)
LiveLongBench: Tackling Long-Context Understanding for Spoken Texts from Live Streams [4.917265821383127]
We construct the first spoken long-text dataset, derived from live streams, to reflect the redundancy-rich and conversational nature of real-world scenarios.<n>We evaluate both popular LLMs and specialized methods to assess their ability to understand long-contexts in these tasks.<n>Our findings highlight key limitations of current methods and suggest future directions for improving long-context understanding.
arXiv Detail & Related papers (2025-04-24T08:27:48Z)
Generalizing From Short to Long: Effective Data Synthesis for Long-Context Instruction Tuning [103.65680870130839]
We investigate how to design instruction data for the post-training phase of a long context pre-trained model.<n>Our controlled study reveals that models instruction-tuned on short contexts can effectively generalize to longer ones.<n>Based on these findings, we propose context synthesis, a novel data synthesis framework.
arXiv Detail & Related papers (2025-02-21T17:02:40Z)
Understanding Synthetic Context Extension via Retrieval Heads [51.8869530817334]
We investigate fine-tuning on synthetic data for three long-context tasks that require retrieval and reasoning.<n>We find that models trained on synthetic data fall short of the real data, but surprisingly, the mismatch can be interpreted.<n>Our results shed light on how to interpret synthetic data fine-tuning performance and how to approach creating better data for learning real-world capabilities over long contexts.
arXiv Detail & Related papers (2024-10-29T17:55:00Z)
ETHIC: Evaluating Large Language Models on Long-Context Tasks with High Information Coverage [21.036912648701264]
We introduce a new metric called information coverage (IC), which quantifies the proportion of the input context necessary for answering queries.<n>We present ETHIC, a novel benchmark designed to assess LLMs' ability to leverage the entire context.
arXiv Detail & Related papers (2024-10-22T09:35:42Z)
HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly [34.205934899868346]
We introduce HELMET, a comprehensive benchmark encompassing seven diverse, application-centric categories.<n>We find that synthetic tasks like NIAH do not reliably predict downstream performance.<n>While most LCLMs achieve perfect NIAH scores, open-source models significantly lag behind closed ones when tasks require full-context reasoning.
arXiv Detail & Related papers (2024-10-03T17:20:11Z)
A Controlled Study on Long Context Extension and Generalization in LLMs [85.4758128256142]
Broad textual understanding and in-context learning require language models that utilize full document contexts. Due to the implementation challenges associated with directly training long-context models, many methods have been proposed for extending models to handle long contexts. We implement a controlled protocol for extension methods with a standardized evaluation, utilizing consistent base models and extension data.
arXiv Detail & Related papers (2024-09-18T17:53:17Z)
NeedleBench: Can LLMs Do Retrieval and Reasoning in Information-Dense Context? [43.98513461616172]
NeedleBench is a framework for assessing retrieval and reasoning performance in long-context tasks.<n>It embeds key data points at varying depths to rigorously test model capabilities.<n>Our experiments reveal that reasoning models like Deep-R1 and OpenAI's o3 struggle with continuous retrieval and reasoning in information-dense scenarios.
arXiv Detail & Related papers (2024-07-16T17:59:06Z)
Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA [71.04146366608904]
Long-context modeling capabilities have garnered widespread attention, leading to the emergence of Large Language Models (LLMs) with ultra-context windows. We propose a novel long-context benchmark, Loong, aligning with realistic scenarios through extended multi-document question answering (QA) Loong introduces four types of tasks with a range of context lengths: Spotlight Locating, Comparison, Clustering, and Chain of Reasoning.
arXiv Detail & Related papers (2024-06-25T09:42:56Z)
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning. Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.