Related papers: LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs

LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs

URL: http://arxiv.org/abs/2409.02076v7
Date: Thu, 23 Jan 2025 00:52:08 GMT
Title: LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs
Authors: Yuhao Wu, Ming Shan Hee, Zhiqing Hu, Roy Ka-Wei Lee,
Abstract summary: LongGenBench is a novel benchmark designed to rigorously evaluate large language models' ability to generate long text.<n>It evaluates model performance across four distinct scenarios, three instruction types, and two generation-lengths (16K and 32K tokens)<n>Our evaluation of ten state-of-the-art LLMs reveals that, despite strong results on Ruler, all models struggled with long text generation on LongGenBench.
Score: 4.4965596747053
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Current benchmarks like Needle-in-a-Haystack (NIAH), Ruler, and Needlebench focus on models' ability to understand long-context input sequences but fail to capture a critical dimension: the generation of high-quality long-form text. Applications such as design proposals, technical documentation, and creative writing rely on coherent, instruction-following outputs over extended sequences - a challenge that existing benchmarks do not adequately address. To fill this gap, we introduce LongGenBench, a novel benchmark designed to rigorously evaluate large language models' (LLMs) ability to generate long text while adhering to complex instructions. Through tasks requiring specific events or constraints within generated text, LongGenBench evaluates model performance across four distinct scenarios, three instruction types, and two generation-lengths (16K and 32K tokens). Our evaluation of ten state-of-the-art LLMs reveals that, despite strong results on Ruler, all models struggled with long text generation on LongGenBench, particularly as text length increased. This suggests that current LLMs are not yet equipped to meet the demands of real-world, long-form text generation.

Related papers

LongEval: A Comprehensive Analysis of Long-Text Generation Through a Plan-based Paradigm [21.661578831520963]
Large Language Models (LLMs) have achieved remarkable success in various natural language processing tasks. Our analysis reveals that current LLMs struggle with length requirements and information density in long-text generation. We present LongEval, a benchmark that evaluates long-text generation through both direct and plan-based generation paradigms.
arXiv Detail & Related papers (2025-02-26T12:46:36Z)
LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation [74.89981179257194]
LongProc (Long Procedural Generation) is a new benchmark for evaluating long-context language models (LCLMs) LongProc consists of six diverse procedural generation tasks, such as extracting structured information from HTML pages into a TSV format and executing complex search procedures to create travel plans. We evaluate 23 LCLMs, including instruction-tuned models and recent reasoning models, on LongProc at three difficulty levels, with the maximum number of output tokens set at 500, 2K, and 8K.
arXiv Detail & Related papers (2025-01-09T18:16:55Z)
Language Models can Self-Lengthen to Generate Long Texts [74.96074422345806]
This paper introduces an innovative iterative training framework called Self-Lengthen. It leverages only the intrinsic knowledge and skills of Large Language Models without the need for auxiliary data or proprietary models. Experiments on benchmarks and human evaluations show that Self-Lengthen outperforms existing methods in long-text generation.
arXiv Detail & Related papers (2024-10-31T13:47:10Z)
LongGenBench: Long-context Generation Benchmark [20.531098579781162]
Long-context generation refers to the ability of a language model to generate coherent and contextually accurate text. We introduce a synthetic benchmark, LongGenBench, which allows for flexible configurations of customized generation context lengths. LongGenBench advances beyond traditional benchmarks by redesigning the format of questions and necessitating a single, cohesive long-context answer.
arXiv Detail & Related papers (2024-10-05T15:33:25Z)
HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models [89.28591263741973]
We introduce the Hierarchical Long Text Generation Benchmark (HelloBench) to evaluate Large Language Models' performance in generating long text. Based on Bloom's taxonomy, HelloBench categorizes long text generation tasks into five subtasks: open-ended QA, summarization, chat, text completion, and text generation. Besides, we propose Hierarchical Long Text Evaluation (HelloEval), a human evaluation method that significantly reduces the time and effort required for human evaluation.
arXiv Detail & Related papers (2024-09-24T15:38:11Z)
LongLaMP: A Benchmark for Personalized Long-form Text Generation [87.41296912519992]
We develop the Long-text Language Model Personalization (LongLaMP) Benchmark. LongLaMP provides a comprehensive and diverse evaluation framework for personalized long-text generation. The results highlight the importance of personalization across a wide variety of long-text generation tasks.
arXiv Detail & Related papers (2024-06-27T01:52:05Z)
Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA [71.04146366608904]
Long-context modeling capabilities have garnered widespread attention, leading to the emergence of Large Language Models (LLMs) with ultra-context windows. We propose a novel long-context benchmark, Loong, aligning with realistic scenarios through extended multi-document question answering (QA) Loong introduces four types of tasks with a range of context lengths: Spotlight Locating, Comparison, Clustering, and Chain of Reasoning.
arXiv Detail & Related papers (2024-06-25T09:42:56Z)
Long Context Alignment with Short Instructions and Synthesized Positions [56.1267385315404]
This paper introduces Step-Skipping Alignment (SkipAlign) It is a new technique designed to enhance the long-context capabilities of Large Language Models (LLMs) With a careful selection of the base model and alignment datasets, SkipAlign with only 6B parameters achieves it's best performance and comparable with strong baselines like GPT-3.5-Turbo-16K on LongBench.
arXiv Detail & Related papers (2024-05-07T01:56:22Z)
RULER: What's the Real Context Size of Your Long-Context Language Models? [23.220973811374225]
We create a new benchmark for evaluating long-context language models (LMs) We evaluate 17 long-context LMs with 13 representative tasks in RULER. Almost all models exhibit large performance drops as the context length increases.
arXiv Detail & Related papers (2024-04-09T23:41:27Z)
XL$^2$Bench: A Benchmark for Extremely Long Context Understanding with Long-range Dependencies [45.31042312867939]
Large Language Models (LLMs) have demonstrated remarkable performance across diverse tasks but are constrained by their small context window sizes. Various efforts have been proposed to expand the context window to accommodate even up to 200K input tokens. We introduce a benchmark for extremely long context understanding with long-range dependencies, XL$2$Bench.
arXiv Detail & Related papers (2024-04-08T12:29:07Z)
NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens [63.7488938083696]
NovelQA is a benchmark designed to test the capabilities of Large Language Models with extended texts. This paper presents the design and construction of NovelQA, highlighting its manual annotation, and diverse question types. Our evaluation of Long-context LLMs on NovelQA reveals significant insights into the models' performance.
arXiv Detail & Related papers (2024-03-18T17:32:32Z)
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding [58.20031627237889]
LongBench is the first bilingual, multi-task benchmark for long context understanding. It comprises 21 datasets across 6 task categories in both English and Chinese, with an average length of 6,711 words (English) and 13,386 characters (Chinese)
arXiv Detail & Related papers (2023-08-28T11:53:40Z)
LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation [49.57366550980932]
Long text modeling requires many capabilities such as modeling long-range commonsense and discourse relations. We propose LOT, a benchmark including two understanding and two generation tasks for Chinese long text modeling evaluation. We release an encoder-decoder Chinese long text pretraining model named LongLM with up to 1 billion parameters.
arXiv Detail & Related papers (2021-08-30T02:38:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.