LongGenBench: Long-context Generation Benchmark
- URL: http://arxiv.org/abs/2410.04199v3
- Date: Thu, 24 Oct 2024 14:43:22 GMT
- Title: LongGenBench: Long-context Generation Benchmark
- Authors: Xiang Liu, Peijie Dong, Xuming Hu, Xiaowen Chu,
- Abstract summary: Long-context generation refers to the ability of a language model to generate coherent and contextually accurate text.
We introduce a synthetic benchmark, LongGenBench, which allows for flexible configurations of customized generation context lengths.
LongGenBench advances beyond traditional benchmarks by redesigning the format of questions and necessitating a single, cohesive long-context answer.
- Score: 20.531098579781162
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current long-context benchmarks primarily focus on retrieval-based tests, requiring Large Language Models (LLMs) to locate specific information within extensive input contexts, such as the needle-in-a-haystack (NIAH) benchmark. Long-context generation refers to the ability of a language model to generate coherent and contextually accurate text that spans across lengthy passages or documents. While recent studies show strong performance on NIAH and other retrieval-based long-context benchmarks, there is a significant lack of benchmarks for evaluating long-context generation capabilities. To bridge this gap and offer a comprehensive assessment, we introduce a synthetic benchmark, LongGenBench, which allows for flexible configurations of customized generation context lengths. LongGenBench advances beyond traditional benchmarks by redesigning the format of questions and necessitating that LLMs respond with a single, cohesive long-context answer. Upon extensive evaluation using LongGenBench, we observe that: (1) both API accessed and open source models exhibit performance degradation in long-context generation scenarios, ranging from 1.2% to 47.1%; (2) different series of LLMs exhibit varying trends of performance degradation, with the Gemini-1.5-Flash model showing the least degradation among API accessed models, and the Qwen2 series exhibiting the least degradation in LongGenBench among open source models.
Related papers
- HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly [34.205934899868346]
We present HELMET, a comprehensive benchmark encompassing seven diverse, application-centric categories.
We find that synthetic tasks like NIAH are not good predictors of downstream performance.
While most LCLMs achieve perfect NIAH scores, open-source models significantly lag behind closed ones when the task requires full-context reasoning.
arXiv Detail & Related papers (2024-10-03T17:20:11Z) - LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs [4.4965596747053]
Long-form text generation is critical for applications such as design proposals and creative writing.
New long-form text evaluation benchmark, LongGenBench, tests models' ability to identify specific events within generated long text sequences.
arXiv Detail & Related papers (2024-09-03T17:25:54Z) - Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA [71.04146366608904]
Long-context modeling capabilities have garnered widespread attention, leading to the emergence of Large Language Models (LLMs) with ultra-context windows.
We propose a novel long-context benchmark, Loong, aligning with realistic scenarios through extended multi-document question answering (QA)
Loong introduces four types of tasks with a range of context lengths: Spotlight Locating, Comparison, Clustering, and Chain of Reasoning.
arXiv Detail & Related papers (2024-06-25T09:42:56Z) - LongSkywork: A Training Recipe for Efficiently Extending Context Length in Large Language Models [61.12177317970258]
LongSkywork is a long-context Large Language Model capable of processing up to 200,000 tokens.
We develop two novel methods for creating synthetic data.
LongSkywork achieves outstanding performance on a variety of long-context benchmarks.
arXiv Detail & Related papers (2024-06-02T03:34:41Z) - Long Context Alignment with Short Instructions and Synthesized Positions [56.1267385315404]
This paper introduces Step-Skipping Alignment (SkipAlign)
It is a new technique designed to enhance the long-context capabilities of Large Language Models (LLMs)
With a careful selection of the base model and alignment datasets, SkipAlign with only 6B parameters achieves it's best performance and comparable with strong baselines like GPT-3.5-Turbo-16K on LongBench.
arXiv Detail & Related papers (2024-05-07T01:56:22Z) - Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks [76.43527940649939]
We introduce Ada-LEval, a benchmark for evaluating the long-context understanding of large language models (LLMs)
Ada-LEval includes two challenging subsets, TSort and BestAnswer, which enable a more reliable evaluation of LLMs' long context capabilities.
We evaluate 4 state-of-the-art closed-source API models and 6 open-source models with Ada-LEval.
arXiv Detail & Related papers (2024-04-09T17:30:48Z) - CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models [45.892014195594314]
We present CLongEval, a comprehensive Chinese benchmark for evaluating long-context LLMs.
CLongEval is characterized by three key features: (1) Sufficient data volume, comprising 7 distinct tasks and 7,267 examples; (2) Broad applicability, accommodating to models with context windows size from 1K to 100K; (3) High quality, with over 2,000 manually annotated question-answer pairs in addition to the automatically constructed labels.
arXiv Detail & Related papers (2024-03-06T07:43:43Z) - LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding [58.20031627237889]
LongBench is the first bilingual, multi-task benchmark for long context understanding.
It comprises 21 datasets across 6 task categories in both English and Chinese, with an average length of 6,711 words (English) and 13,386 characters (Chinese)
arXiv Detail & Related papers (2023-08-28T11:53:40Z) - LOT: A Benchmark for Evaluating Chinese Long Text Understanding and
Generation [49.57366550980932]
Long text modeling requires many capabilities such as modeling long-range commonsense and discourse relations.
We propose LOT, a benchmark including two understanding and two generation tasks for Chinese long text modeling evaluation.
We release an encoder-decoder Chinese long text pretraining model named LongLM with up to 1 billion parameters.
arXiv Detail & Related papers (2021-08-30T02:38:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.