Related papers: HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models

HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models

URL: http://arxiv.org/abs/2409.16191v1
Date: Tue, 24 Sep 2024 15:38:11 GMT
Title: HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models
Authors: Haoran Que, Feiyu Duan, Liqun He, Yutao Mou, Wangchunshu Zhou, Jiaheng Liu, Wenge Rong, Zekun Moore Wang, Jian Yang, Ge Zhang, Junran Peng, Zhaoxiang Zhang, Songyang Zhang, Kai Chen,
Abstract summary: We introduce the Hierarchical Long Text Generation Benchmark (HelloBench) to evaluate Large Language Models' performance in generating long text. Based on Bloom's taxonomy, HelloBench categorizes long text generation tasks into five subtasks: open-ended QA, summarization, chat, text completion, and text generation. Besides, we propose Hierarchical Long Text Evaluation (HelloEval), a human evaluation method that significantly reduces the time and effort required for human evaluation.
Score: 89.28591263741973
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks (e.g., long-context understanding), and many benchmarks have been proposed. However, we observe that long text generation capabilities are not well investigated. Therefore, we introduce the Hierarchical Long Text Generation Benchmark (HelloBench), a comprehensive, in-the-wild, and open-ended benchmark to evaluate LLMs' performance in generating long text. Based on Bloom's Taxonomy, HelloBench categorizes long text generation tasks into five subtasks: open-ended QA, summarization, chat, text completion, and heuristic text generation. Besides, we propose Hierarchical Long Text Evaluation (HelloEval), a human-aligned evaluation method that significantly reduces the time and effort required for human evaluation while maintaining a high correlation with human evaluation. We have conducted extensive experiments across around 30 mainstream LLMs and observed that the current LLMs lack long text generation capabilities. Specifically, first, regardless of whether the instructions include explicit or implicit length constraints, we observe that most LLMs cannot generate text that is longer than 4000 words. Second, we observe that while some LLMs can generate longer text, many issues exist (e.g., severe repetition and quality degradation). Third, to demonstrate the effectiveness of HelloEval, we compare HelloEval with traditional metrics (e.g., ROUGE, BLEU, etc.) and LLM-as-a-Judge methods, which show that HelloEval has the highest correlation with human evaluation. We release our code in https://github.com/Quehry/HelloBench.

Related papers

LongEval: A Comprehensive Analysis of Long-Text Generation Through a Plan-based Paradigm [21.661578831520963]
Large Language Models (LLMs) have achieved remarkable success in various natural language processing tasks. Our analysis reveals that current LLMs struggle with length requirements and information density in long-text generation. We present LongEval, a benchmark that evaluates long-text generation through both direct and plan-based generation paradigms.
arXiv Detail & Related papers (2025-02-26T12:46:36Z)
Language Models can Self-Lengthen to Generate Long Texts [74.96074422345806]
This paper introduces an innovative iterative training framework called Self-Lengthen. It leverages only the intrinsic knowledge and skills of Large Language Models without the need for auxiliary data or proprietary models. Experiments on benchmarks and human evaluations show that Self-Lengthen outperforms existing methods in long-text generation.
arXiv Detail & Related papers (2024-10-31T13:47:10Z)
LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs [4.4965596747053]
Long-form text generation is critical for applications such as design proposals and creative writing. New long-form text evaluation benchmark, LongGenBench, tests models' ability to identify specific events within generated long text sequences.
arXiv Detail & Related papers (2024-09-03T17:25:54Z)
RepoQA: Evaluating Long Context Code Understanding [12.329233433333416]
RepoQA is a benchmark to evaluate Large Language Models (LLMs) on long-context code understanding. RepoQA includes 500 code search tasks gathered from 50 popular repositories across 5 modern programming languages.
arXiv Detail & Related papers (2024-06-10T05:15:30Z)
Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks [76.43527940649939]
We introduce Ada-LEval, a benchmark for evaluating the long-context understanding of large language models (LLMs) Ada-LEval includes two challenging subsets, TSort and BestAnswer, which enable a more reliable evaluation of LLMs' long context capabilities. We evaluate 4 state-of-the-art closed-source API models and 6 open-source models with Ada-LEval.
arXiv Detail & Related papers (2024-04-09T17:30:48Z)
XL$^2$Bench: A Benchmark for Extremely Long Context Understanding with Long-range Dependencies [45.31042312867939]
Large Language Models (LLMs) have demonstrated remarkable performance across diverse tasks but are constrained by their small context window sizes. Various efforts have been proposed to expand the context window to accommodate even up to 200K input tokens. We introduce a benchmark for extremely long context understanding with long-range dependencies, XL$2$Bench.
arXiv Detail & Related papers (2024-04-08T12:29:07Z)
BAMBOO: A Comprehensive Benchmark for Evaluating Long Text Modeling Capacities of Large Language Models [141.21603469555225]
Large language models (LLMs) have achieved dramatic proficiency over NLP tasks with normal length. We propose BAMBOO, a multi-task long context benchmark. It consists of 10 datasets from 5 different long text understanding tasks.
arXiv Detail & Related papers (2023-09-23T11:36:15Z)
Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools. Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions. Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding [58.20031627237889]
LongBench is the first bilingual, multi-task benchmark for long context understanding. It comprises 21 datasets across 6 task categories in both English and Chinese, with an average length of 6,711 words (English) and 13,386 characters (Chinese)
arXiv Detail & Related papers (2023-08-28T11:53:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.