100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?
- URL: http://arxiv.org/abs/2505.19293v1
- Date: Sun, 25 May 2025 19:58:31 GMT
- Title: 100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?
- Authors: Wang Yang, Hongye Jin, Shaochen Zhong, Song Jiang, Qifan Wang, Vipin Chaudhary, Xiaotian Han,
- Abstract summary: Real-task-based long-context evaluation benchmarks have two major shortcomings.<n> benchmarks like LongBench often do not provide proper metrics to separate long-context performance from the model's baseline ability.<n>We introduce a length-controllable long-context benchmark and a novel metric that disentangles baseline knowledge from true long-context capabilities.
- Score: 28.694112253150983
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Long-context capability is considered one of the most important abilities of LLMs, as a truly long context-capable LLM enables users to effortlessly process many originally exhausting tasks -- e.g., digesting a long-form document to find answers vs. directly asking an LLM about it. However, existing real-task-based long-context evaluation benchmarks have two major shortcomings. First, benchmarks like LongBench often do not provide proper metrics to separate long-context performance from the model's baseline ability, making cross-model comparison unclear. Second, such benchmarks are usually constructed with fixed input lengths, which limits their applicability across different models and fails to reveal when a model begins to break down. To address these issues, we introduce a length-controllable long-context benchmark and a novel metric that disentangles baseline knowledge from true long-context capabilities. Experiments demonstrate the superiority of our approach in effectively evaluating LLMs.
Related papers
- Ref-Long: Benchmarking the Long-context Referencing Capability of Long-context Language Models [36.69535336525585]
Long-context language models (LCLMs) have exhibited impressive capabilities in long-context understanding tasks.<n>Long-context referencing is a crucial task that requires LCLMs to attribute items of interest to specific parts of long-context data.<n>This paper proposes Ref-Long, a novel benchmark designed to assess the long-context referencing capability of LCLMs.
arXiv Detail & Related papers (2025-07-13T06:17:53Z) - MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models [52.60063131713119]
Long Context Understanding (LCU) is a critical area for exploration in current large language models (LLMs)<n>Existing LCU benchmarks for LLMs often result in prohibitively high evaluation costs.<n>We propose a concise data compression method tailored for long-text data with sparse information characteristics.
arXiv Detail & Related papers (2025-05-26T13:21:18Z) - LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion [20.293369733522983]
LongReason is a synthetic benchmark for evaluating the long-context reasoning capabilities of large language models.<n>LongReason consists of 794 multiple-choice reasoning questions with diverse reasoning patterns across three task categories.<n>We evaluate 21 LLMs on LongReason, revealing that most models experience significant performance drops as context length increases.
arXiv Detail & Related papers (2025-01-25T05:32:14Z) - What is Wrong with Perplexity for Long-context Language Modeling? [71.34933096461124]
Long-context inputs are crucial for large language models (LLMs) in tasks such as extended conversations, document summarization, and many-shot in-context learning.<n>Perplexity (PPL) has proven unreliable for assessing long-context capabilities.<n>We propose bfLongPPL, a novel metric that focuses on key tokens by employing a long-short context contrastive method to identify them.
arXiv Detail & Related papers (2024-10-31T09:39:28Z) - Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA [71.04146366608904]
Long-context modeling capabilities have garnered widespread attention, leading to the emergence of Large Language Models (LLMs) with ultra-context windows.
We propose a novel long-context benchmark, Loong, aligning with realistic scenarios through extended multi-document question answering (QA)
Loong introduces four types of tasks with a range of context lengths: Spotlight Locating, Comparison, Clustering, and Chain of Reasoning.
arXiv Detail & Related papers (2024-06-25T09:42:56Z) - Long Context Alignment with Short Instructions and Synthesized Positions [56.1267385315404]
This paper introduces Step-Skipping Alignment (SkipAlign)
It is a new technique designed to enhance the long-context capabilities of Large Language Models (LLMs)
With a careful selection of the base model and alignment datasets, SkipAlign with only 6B parameters achieves it's best performance and comparable with strong baselines like GPT-3.5-Turbo-16K on LongBench.
arXiv Detail & Related papers (2024-05-07T01:56:22Z) - Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks [76.43527940649939]
We introduce Ada-LEval, a benchmark for evaluating the long-context understanding of large language models (LLMs)
Ada-LEval includes two challenging subsets, TSort and BestAnswer, which enable a more reliable evaluation of LLMs' long context capabilities.
We evaluate 4 state-of-the-art closed-source API models and 6 open-source models with Ada-LEval.
arXiv Detail & Related papers (2024-04-09T17:30:48Z) - LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding [58.20031627237889]
LongBench is the first bilingual, multi-task benchmark for long context understanding.
It comprises 21 datasets across 6 task categories in both English and Chinese, with an average length of 6,711 words (English) and 13,386 characters (Chinese)
arXiv Detail & Related papers (2023-08-28T11:53:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.