Related papers: LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation

LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation

URL: http://arxiv.org/abs/2501.05414v2
Date: Mon, 21 Apr 2025 13:50:28 GMT
Title: LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation
Authors: Xi Ye, Fangcong Yin, Yinghui He, Joie Zhang, Howard Yen, Tianyu Gao, Greg Durrett, Danqi Chen,
Abstract summary: LongProc (Long Procedural Generation) is a new benchmark for evaluating long-context language models (LCLMs)<n>LongProc consists of six diverse procedural generation tasks, such as extracting structured information from HTML pages into a TSV format and executing complex search procedures to create travel plans.<n>We evaluate 23 LCLMs, including instruction-tuned models and recent reasoning models, on LongProc at three difficulty levels, with the maximum number of output tokens set at 500, 2K, and 8K.
Score: 74.89981179257194
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing benchmarks for evaluating long-context language models (LCLMs) primarily focus on long-context recall, requiring models to produce short responses based on a few critical snippets while processing thousands of irrelevant tokens. We introduce LongProc (Long Procedural Generation), a new benchmark that requires both the integration of highly dispersed information and long-form generation. LongProc consists of six diverse procedural generation tasks, such as extracting structured information from HTML pages into a TSV format and executing complex search procedures to create travel plans. These tasks challenge LCLMs by testing their ability to follow detailed procedural instructions, synthesize and reason over dispersed information, and generate structured, long-form outputs (up to 8K tokens). Furthermore, as these tasks adhere to deterministic procedures and yield structured outputs, they enable reliable rule-based evaluation. We evaluated 23 LCLMs, including instruction-tuned models and recent reasoning models, on LongProc at three difficulty levels, with the maximum number of output tokens set at 500, 2K, and 8K. Notably, while all tested models claim a context window size above 32K tokens, open-weight models typically falter on 2K-token tasks, and closed-source models like GPT-4o show significant degradation on 8K-token tasks. Reasoning models achieve stronger overall performance in long-form generation, benefiting from long CoT training. Further analysis reveals that LCLMs struggle to maintain long-range coherence in long-form generations. These findings highlight critical limitations in current LCLMs and suggest substantial room for improvement. Data and code available at: https://princeton-pli.github.io/LongProc.

Related papers

Too Long, Didn't Model: Decomposing LLM Long-Context Understanding With Novels [3.537369004801589]
We release the Too Long, Didn't Model benchmark.<n>It tests a model's ability to report plot summary, storyworld configuration, and elapsed narrative time.<n>We find that none of seven tested frontier LLMs retain stable understanding beyond 64k tokens.
arXiv Detail & Related papers (2025-05-20T21:21:09Z)
How to Train Long-Context Language Models (Effectively) [75.5418485597276]
We study continued training and supervised fine-tuning (SFT) of a language model (LM) to make effective use of long-context information. We find that code repositories and books are excellent sources of long data, but it is crucial to combine them with high-quality short-context data. Our final model, ProLong-8B, demonstrates state-of-the-art long-context performance among similarly sized models at a length of 128K.
arXiv Detail & Related papers (2024-10-03T16:46:52Z)
LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs [4.4965596747053]
LongGenBench is a novel benchmark designed to rigorously evaluate large language models' ability to generate long text. It evaluates model performance across four distinct scenarios, three instruction types, and two generation-lengths (16K and 32K tokens) Our evaluation of ten state-of-the-art LLMs reveals that, despite strong results on Ruler, all models struggled with long text generation on LongGenBench.
arXiv Detail & Related papers (2024-09-03T17:25:54Z)
ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities [53.97515452727115]
ChatQA 2 is a Llama 3.0-based model with a 128K context window. We present a training recipe to extend the context window of Llama3-70B-base from 8K to 128K tokens. We find that the performance of strong long-context LLMs using RAG improves when retrieving a larger number of chunks.
arXiv Detail & Related papers (2024-07-19T17:35:47Z)
Long Context is Not Long at All: A Prospector of Long-Dependency Data for Large Language Models [13.091271774417867]
Long-context modeling capabilities are important for large language models (LLMs) in various applications. We propose a data mining framework textbfProLong that can assign each training sample with a long dependency score. Comprehensive experiments on multiple benchmarks indicate that ProLong effectively identifies documents that carry long dependencies.
arXiv Detail & Related papers (2024-05-28T07:36:56Z)
Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks [76.43527940649939]
We introduce Ada-LEval, a benchmark for evaluating the long-context understanding of large language models (LLMs) Ada-LEval includes two challenging subsets, TSort and BestAnswer, which enable a more reliable evaluation of LLMs' long context capabilities. We evaluate 4 state-of-the-art closed-source API models and 6 open-source models with Ada-LEval.
arXiv Detail & Related papers (2024-04-09T17:30:48Z)
Effective Long-Context Scaling of Foundation Models [90.57254298730923]
We present a series of long-context LLMs that support effective context windows of up to 32,768 tokens. Our models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks over Llama 2.
arXiv Detail & Related papers (2023-09-27T21:41:49Z)
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding [58.20031627237889]
LongBench is the first bilingual, multi-task benchmark for long context understanding. It comprises 21 datasets across 6 task categories in both English and Chinese, with an average length of 6,711 words (English) and 13,386 characters (Chinese)
arXiv Detail & Related papers (2023-08-28T11:53:40Z)
L-Eval: Instituting Standardized Evaluation for Long Context Language Models [91.05820785008527]
We propose L-Eval to institute a more standardized evaluation for long context language models (LCLMs) We build a new evaluation suite containing 20 sub-tasks, 508 long documents, and over 2,000 human-labeled query-response pairs. Results show that popular n-gram matching metrics generally can not correlate well with human judgment.
arXiv Detail & Related papers (2023-07-20T17:59:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.