LooGLE: Can Long-Context Language Models Understand Long Contexts?
- URL: http://arxiv.org/abs/2311.04939v1
- Date: Wed, 8 Nov 2023 01:45:37 GMT
- Title: LooGLE: Can Long-Context Language Models Understand Long Contexts?
- Authors: Jiaqi Li, Mengmeng Wang, Zilong Zheng, Muhan Zhang
- Abstract summary: LooGLE is a benchmark for large language models' long context understanding.
It features relatively new documents post-2022, with over 24,000 tokens per document and 6,000 newly generated questions spanning diverse domains.
The evaluation of eight state-of-the-art LLMs on LooGLE revealed key findings.
- Score: 50.408957515411096
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs), despite their impressive performance in various
language tasks, are typically limited to processing texts within context-window
size. This limitation has spurred significant research efforts to enhance LLMs'
long-context understanding with high-quality long-sequence benchmarks. However,
prior datasets in this regard suffer from shortcomings, such as short context
length compared to the context window of modern LLMs; outdated documents that
have data leakage problems; and an emphasis on short dependency tasks rather
than long dependency tasks. In this paper, we present LooGLE, a Long Context
Generic Language Evaluation benchmark for LLMs' long context understanding.
LooGLE features relatively new documents post-2022, with over 24,000 tokens per
document and 6,000 newly generated questions spanning diverse domains. Human
annotators meticulously crafted more than 1,100 high-quality question-answer
pairs to meet the long dependency requirements. These pairs underwent thorough
cross-validation, yielding the most precise assessment of LLMs' long dependency
capabilities. The evaluation of eight state-of-the-art LLMs on LooGLE revealed
key findings: (i) commercial models outperformed open-sourced models; (ii) LLMs
excelled in short dependency tasks like short question-answering and cloze
tasks but struggled with more intricate long dependency tasks; (iii) in-context
learning and chaining thoughts offered only marginal improvements; (iv)
retrieval-based techniques demonstrated substantial benefits for short
question-answering, while strategies for extending context window length had
limited impact on long context understanding. As such, LooGLE not only provides
a systematic and comprehensive evaluation schema on long-context LLMs, but also
sheds light on future development of enhanced models towards "true long-context
understanding".
Related papers
- LongIns: A Challenging Long-context Instruction-based Exam for LLMs [44.51209510772957]
Long-context capabilities of large language models (LLMs) have been a hot topic in recent years.
We propose the LongIns benchmark dataset, a challenging long-context instruction-based exam for LLMs.
arXiv Detail & Related papers (2024-06-25T14:31:26Z) - Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA [71.04146366608904]
Long-context modeling capabilities have garnered widespread attention, leading to the emergence of Large Language Models (LLMs) with ultra-context windows.
We propose a novel long-context benchmark, Loong, aligning with realistic scenarios through extended multi-document question answering (QA)
Loong introduces four types of tasks with a range of context lengths: Spotlight Locating, Comparison, Clustering, and Chain of Reasoning.
arXiv Detail & Related papers (2024-06-25T09:42:56Z) - Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks [76.43527940649939]
We introduce Ada-LEval, a benchmark for evaluating the long-context understanding of large language models (LLMs)
Ada-LEval includes two challenging subsets, TSort and BestAnswer, which enable a more reliable evaluation of LLMs' long context capabilities.
We evaluate 4 state-of-the-art closed-source API models and 6 open-source models with Ada-LEval.
arXiv Detail & Related papers (2024-04-09T17:30:48Z) - XL$^2$Bench: A Benchmark for Extremely Long Context Understanding with Long-range Dependencies [45.31042312867939]
Large Language Models (LLMs) have demonstrated remarkable performance across diverse tasks but are constrained by their small context window sizes.
Various efforts have been proposed to expand the context window to accommodate even up to 200K input tokens.
We introduce a benchmark for extremely long context understanding with long-range dependencies, XL$2$Bench.
arXiv Detail & Related papers (2024-04-08T12:29:07Z) - NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens [63.7488938083696]
NovelQA is a benchmark designed to test the capabilities of Large Language Models with extended texts.
This paper presents the design and construction of NovelQA, highlighting its manual annotation, and diverse question types.
Our evaluation of Long-context LLMs on NovelQA reveals significant insights into the models' performance.
arXiv Detail & Related papers (2024-03-18T17:32:32Z) - LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding [58.20031627237889]
LongBench is the first bilingual, multi-task benchmark for long context understanding.
It comprises 21 datasets across 6 task categories in both English and Chinese, with an average length of 6,711 words (English) and 13,386 characters (Chinese)
arXiv Detail & Related papers (2023-08-28T11:53:40Z) - L-Eval: Instituting Standardized Evaluation for Long Context Language
Models [91.05820785008527]
We propose L-Eval to institute a more standardized evaluation for long context language models (LCLMs)
We build a new evaluation suite containing 20 sub-tasks, 508 long documents, and over 2,000 human-labeled query-response pairs.
Results show that popular n-gram matching metrics generally can not correlate well with human judgment.
arXiv Detail & Related papers (2023-07-20T17:59:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.