LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs
- URL: http://arxiv.org/abs/2409.02076v6
- Date: Tue, 8 Oct 2024 01:26:00 GMT
- Title: LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs
- Authors: Yuhao Wu, Ming Shan Hee, Zhiqing Hu, Roy Ka-Wei Lee,
- Abstract summary: Long-form text generation is critical for applications such as design proposals and creative writing.
New long-form text evaluation benchmark, LongGenBench, tests models' ability to identify specific events within generated long text sequences.
- Score: 4.4965596747053
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In evaluating the long-context capabilities of large language models (LLMs), benchmarks such as "Needle-in-a-Haystack" (NIAH), Ruler, and Needlebench are commonly used. While these benchmarks measure how well models understand long-context input sequences, they do not effectively gauge the quality of long-form text generation--a critical aspect for applications such as design proposals and creative writing. To address this gap, we have introduced a new long-form text evaluation benchmark, LongGenBench, which tests models' ability to identify specific events within generated long text sequences. In this benchmark, we prompt long-context LMs to create long-form text that must include particular events or constraints and evaluate their ability to incorporate these elements. We evaluated ten long-context LMs across four distinct scenarios, three types of prompt instructions, and two different generation-length settings (16K and 32K). Although these models perform well on NIAH benchmarks, none demonstrated satisfactory performance on the LongGenBench, raising concerns about their ability to generate coherent long-form text that follows instructions. Additionally, as the length of the generated text increases, all models exhibit a significant drop in performance.
Related papers
- LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation [74.89981179257194]
LongProc (Long Procedural Generation) is a new benchmark for evaluating long-context language models (LCLMs)
LongProc consists of six diverse procedural generation tasks, such as extracting structured information from HTML pages into a TSV format and executing complex search procedures to create travel plans.
We evaluate 17 LCLMs on LongProc across three difficulty levels, with maximum numbers of output tokens set at 500, 2K, and 8K. Notably, while all tested models claim a context window size above 32K tokens, open-weight models typically falter on 2K-token tasks, and closed-source models like GPT
arXiv Detail & Related papers (2025-01-09T18:16:55Z) - Language Models can Self-Lengthen to Generate Long Texts [74.96074422345806]
This paper introduces an innovative iterative training framework called Self-Lengthen.
It leverages only the intrinsic knowledge and skills of Large Language Models without the need for auxiliary data or proprietary models.
Experiments on benchmarks and human evaluations show that Self-Lengthen outperforms existing methods in long-text generation.
arXiv Detail & Related papers (2024-10-31T13:47:10Z) - LongGenBench: Long-context Generation Benchmark [20.531098579781162]
Long-context generation refers to the ability of a language model to generate coherent and contextually accurate text.
We introduce a synthetic benchmark, LongGenBench, which allows for flexible configurations of customized generation context lengths.
LongGenBench advances beyond traditional benchmarks by redesigning the format of questions and necessitating a single, cohesive long-context answer.
arXiv Detail & Related papers (2024-10-05T15:33:25Z) - HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models [89.28591263741973]
We introduce the Hierarchical Long Text Generation Benchmark (HelloBench) to evaluate Large Language Models' performance in generating long text.
Based on Bloom's taxonomy, HelloBench categorizes long text generation tasks into five subtasks: open-ended QA, summarization, chat, text completion, and text generation.
Besides, we propose Hierarchical Long Text Evaluation (HelloEval), a human evaluation method that significantly reduces the time and effort required for human evaluation.
arXiv Detail & Related papers (2024-09-24T15:38:11Z) - LongLaMP: A Benchmark for Personalized Long-form Text Generation [87.41296912519992]
We develop the Long-text Language Model Personalization (LongLaMP) Benchmark.
LongLaMP provides a comprehensive and diverse evaluation framework for personalized long-text generation.
The results highlight the importance of personalization across a wide variety of long-text generation tasks.
arXiv Detail & Related papers (2024-06-27T01:52:05Z) - XL$^2$Bench: A Benchmark for Extremely Long Context Understanding with Long-range Dependencies [45.31042312867939]
Large Language Models (LLMs) have demonstrated remarkable performance across diverse tasks but are constrained by their small context window sizes.
Various efforts have been proposed to expand the context window to accommodate even up to 200K input tokens.
We introduce a benchmark for extremely long context understanding with long-range dependencies, XL$2$Bench.
arXiv Detail & Related papers (2024-04-08T12:29:07Z) - NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens [63.7488938083696]
NovelQA is a benchmark designed to test the capabilities of Large Language Models with extended texts.
This paper presents the design and construction of NovelQA, highlighting its manual annotation, and diverse question types.
Our evaluation of Long-context LLMs on NovelQA reveals significant insights into the models' performance.
arXiv Detail & Related papers (2024-03-18T17:32:32Z) - LOT: A Benchmark for Evaluating Chinese Long Text Understanding and
Generation [49.57366550980932]
Long text modeling requires many capabilities such as modeling long-range commonsense and discourse relations.
We propose LOT, a benchmark including two understanding and two generation tasks for Chinese long text modeling evaluation.
We release an encoder-decoder Chinese long text pretraining model named LongLM with up to 1 billion parameters.
arXiv Detail & Related papers (2021-08-30T02:38:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.