Related papers: BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

URL: http://arxiv.org/abs/2406.10149v2
Date: Wed, 06 Nov 2024 14:50:40 GMT
Title: BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack
Authors: Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Sorokin, Mikhail Burtsev,
Abstract summary: We introduce the BABILong benchmark, designed to test language models' ability to reason across facts distributed in long documents. BABILong includes a diverse set of 20 reasoning tasks, including fact chaining, simple induction, deduction, counting, and handling lists/sets. Our evaluations show that popular LLMs effectively utilize only 10-20% of the context and their performance declines sharply with increased reasoning complexity.
Score: 4.3482088816575155
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In recent years, the input context sizes of large language models (LLMs) have increased dramatically. However, existing evaluation methods have not kept pace, failing to comprehensively assess the efficiency of models in handling long contexts. To bridge this gap, we introduce the BABILong benchmark, designed to test language models' ability to reason across facts distributed in extremely long documents. BABILong includes a diverse set of 20 reasoning tasks, including fact chaining, simple induction, deduction, counting, and handling lists/sets. These tasks are challenging on their own, and even more demanding when the required facts are scattered across long natural text. Our evaluations show that popular LLMs effectively utilize only 10-20\% of the context and their performance declines sharply with increased reasoning complexity. Among alternatives to in-context reasoning, Retrieval-Augmented Generation methods achieve a modest 60\% accuracy on single-fact question answering, independent of context length. Among context extension methods, the highest performance is demonstrated by recurrent memory transformers after fine-tuning, enabling the processing of lengths up to 50 million tokens. The BABILong benchmark is extendable to any length to support the evaluation of new upcoming models with increased capabilities, and we provide splits up to 10 million token lengths.

Related papers

100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability? [28.694112253150983]
Real-task-based long-context evaluation benchmarks have two major shortcomings.<n> benchmarks like LongBench often do not provide proper metrics to separate long-context performance from the model's baseline ability.<n>We introduce a length-controllable long-context benchmark and a novel metric that disentangles baseline knowledge from true long-context capabilities.
arXiv Detail & Related papers (2025-05-25T19:58:31Z)
LIFEBench: Evaluating Length Instruction Following in Large Language Models [41.637005190608946]
We introduce LIFEBench to evaluate large language models' ability to follow length instructions.<n>LIFEBench consists of 10,800 instances across 4 task categories in both English and Chinese.<n>We find that most models reasonably follow short-length instructions but deteriorate sharply beyond a certain threshold.
arXiv Detail & Related papers (2025-05-22T05:08:27Z)
NoLiMa: Long-Context Evaluation Beyond Literal Matching [100.00398424275501]
NoLiMa is a benchmark extending the needle-in-a-haystack (NIAH) test. It requires models to infer latent associations to locate the needle within the haystack. We evaluate 12 popular large language models that claim to support contexts of at least 128K tokens.
arXiv Detail & Related papers (2025-02-07T18:49:46Z)
Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks? [36.83397306207386]
We evaluate the capabilities of 17 leading Large Language Models (LLMs) Strikingly, many models are remarkably threadsafe: capable of simultaneously following multiple threads without significant loss in performance. We find the effective context limit is significantly shorter than the supported context length, with accuracy decreasing as the context window grows.
arXiv Detail & Related papers (2024-11-07T18:59:27Z)
LongGenBench: Long-context Generation Benchmark [20.531098579781162]
Long-context generation refers to the ability of a language model to generate coherent and contextually accurate text. We introduce a synthetic benchmark, LongGenBench, which allows for flexible configurations of customized generation context lengths. LongGenBench advances beyond traditional benchmarks by redesigning the format of questions and necessitating a single, cohesive long-context answer.
arXiv Detail & Related papers (2024-10-05T15:33:25Z)
How to Train Long-Context Language Models (Effectively) [75.5418485597276]
We study continued training and supervised fine-tuning (SFT) of a language model (LM) to make effective use of long-context information. ProLong-8B, which is from Llama-3 and trained on 40B tokens, demonstrates state-of-the-art long-context performance among similarly sized models at a length of 128K.
arXiv Detail & Related papers (2024-10-03T16:46:52Z)
Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA [71.04146366608904]
Long-context modeling capabilities have garnered widespread attention, leading to the emergence of Large Language Models (LLMs) with ultra-context windows. We propose a novel long-context benchmark, Loong, aligning with realistic scenarios through extended multi-document question answering (QA) Loong introduces four types of tasks with a range of context lengths: Spotlight Locating, Comparison, Clustering, and Chain of Reasoning.
arXiv Detail & Related papers (2024-06-25T09:42:56Z)
RULER: What's the Real Context Size of Your Long-Context Language Models? [23.220973811374225]
We create a new benchmark for evaluating long-context language models (LMs) We evaluate 17 long-context LMs with 13 representative tasks in RULER. Almost all models exhibit large performance drops as the context length increases.
arXiv Detail & Related papers (2024-04-09T23:41:27Z)
Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks [76.43527940649939]
We introduce Ada-LEval, a benchmark for evaluating the long-context understanding of large language models (LLMs) Ada-LEval includes two challenging subsets, TSort and BestAnswer, which enable a more reliable evaluation of LLMs' long context capabilities. We evaluate 4 state-of-the-art closed-source API models and 6 open-source models with Ada-LEval.
arXiv Detail & Related papers (2024-04-09T17:30:48Z)
XL$^2$Bench: A Benchmark for Extremely Long Context Understanding with Long-range Dependencies [45.31042312867939]
Large Language Models (LLMs) have demonstrated remarkable performance across diverse tasks but are constrained by their small context window sizes. Various efforts have been proposed to expand the context window to accommodate even up to 200K input tokens. We introduce a benchmark for extremely long context understanding with long-range dependencies, XL$2$Bench.
arXiv Detail & Related papers (2024-04-08T12:29:07Z)
LongAlign: A Recipe for Long Context Alignment of Large Language Models [61.85923382850057]
LongAlign is a recipe of the instruction data, training, and evaluation for long context alignment. We construct a long instruction-following dataset using Self-Instruct. We adopt the packing and sorted strategies to speed up supervised fine-tuning on data with varied length distributions.
arXiv Detail & Related papers (2024-01-31T18:29:39Z)
BAMBOO: A Comprehensive Benchmark for Evaluating Long Text Modeling Capacities of Large Language Models [141.21603469555225]
Large language models (LLMs) have achieved dramatic proficiency over NLP tasks with normal length. We propose BAMBOO, a multi-task long context benchmark. It consists of 10 datasets from 5 different long text understanding tasks.
arXiv Detail & Related papers (2023-09-23T11:36:15Z)
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding [58.20031627237889]
LongBench is the first bilingual, multi-task benchmark for long context understanding. It comprises 21 datasets across 6 task categories in both English and Chinese, with an average length of 6,711 words (English) and 13,386 characters (Chinese)
arXiv Detail & Related papers (2023-08-28T11:53:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.