Related papers: LongWeave: A Long-Form Generation Benchmark Bridging Real-World Relevance and Verifiability

LongWeave: A Long-Form Generation Benchmark Bridging Real-World Relevance and Verifiability

URL: http://arxiv.org/abs/2510.24345v1
Date: Tue, 28 Oct 2025 12:11:12 GMT
Title: LongWeave: A Long-Form Generation Benchmark Bridging Real-World Relevance and Verifiability
Authors: Zikai Xiao, Fei Huang, Jianhong Tu, Jianhui Wei, Wen Ma, Yuxuan Zhou, Jian Wu, Bowen Yu, Zuozhu Liu, Junyang Lin,
Abstract summary: We introduce textbfLongWeave, which balances real-world and verifiable assessment with Constraint-Verifier Evaluation (CoV-Eval)<n>LongWeave supports customizable input/output lengths (up to 64K/8K tokens) across seven distinct tasks.<n> Evaluation on 23 Large Language Models shows that even state-of-the-art models encounter significant challenges in long-form generation as real-world complexity and output length increase.
Score: 60.451734326001564
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Generating long, informative, and factual outputs remains a major challenge for Large Language Models (LLMs). Existing benchmarks for long-form generation typically assess real-world queries with hard-to-verify metrics or use synthetic setups that ease evaluation but overlook real-world intricacies. In this paper, we introduce \textbf{LongWeave}, which balances real-world and verifiable assessment with Constraint-Verifier Evaluation (CoV-Eval). CoV-Eval constructs tasks by first defining verifiable targets within real-world scenarios, then systematically generating corresponding queries, textual materials, and constraints based on these targets. This ensures that tasks are both realistic and objectively assessable, enabling rigorous assessment of model capabilities in meeting complex real-world constraints. LongWeave supports customizable input/output lengths (up to 64K/8K tokens) across seven distinct tasks. Evaluation on 23 LLMs shows that even state-of-the-art models encounter significant challenges in long-form generation as real-world complexity and output length increase.

Related papers

LooGLE v2: Are LLMs Ready for Real World Long Dependency Challenges? [35.43518917055024]
LooGLE v2 is a novel benchmark designed to evaluate large language models' long context ability in real-world applications and scenarios.<n>Our benchmark consists of automatically collected real-world long texts, ranging from 16k to 2M tokens, encompassing domains in law, finance, game and code.<n> evaluation results show that even the best-performing model achieves only a 59.2% overall score on our benchmark.
arXiv Detail & Related papers (2025-10-26T06:14:19Z)
HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds [0.0]
Large language models (LLMs) have shown remarkable capabilities in isolated step-by-step reasoning tasks such as mathematics and programming.<n>But their proficiency in long-horizon planning, where solutions require extended, structured sequences of interdependent actions, remains underexplored.<n>We introduce HeroBench, a novel benchmark designed specifically to evaluate long-horizon planning and structured reasoning within complex RPG-inspired virtual worlds.
arXiv Detail & Related papers (2025-08-18T09:59:02Z)
A Controllable Examination for Long-Context Language Models [62.845852724511964]
This study introduces $textbfLongBioBench, a benchmark for evaluating long-context language models.<n>We show that most models still exhibit deficiencies in semantic understanding and elementary reasoning over retrieved results.<n>Our further analysis indicates some design choices employed by existing synthetic benchmarks, such as contextual non-coherence.
arXiv Detail & Related papers (2025-06-03T14:23:06Z)
Hierarchical Document Refinement for Long-context Retrieval-augmented Generation [28.421675216147374]
LongRefiner is an efficient plug-and-play refiner that leverages the inherent structural characteristics of long documents.<n>LongRefiner achieves competitive performance in various scenarios while using 10x fewer computational costs and latency compared to the best baseline.
arXiv Detail & Related papers (2025-05-15T15:34:15Z)
WildLong: Synthesizing Realistic Long-Context Instruction Data at Scale [86.25450054683172]
WildLong extracts meta-information from real user queries to produce scalable data.<n>It supports multi-document reasoning, such as cross-document comparison and aggregation.<n>It surpasses existing open-source long-context-optimized models across benchmarks.
arXiv Detail & Related papers (2025-02-23T18:59:09Z)
FaithEval: Can Your Language Model Stay Faithful to Context, Even If "The Moon is Made of Marshmallows" [74.7488607599921]
FaithEval is a benchmark to evaluate the faithfulness of large language models (LLMs) in contextual scenarios.<n>FaithEval comprises 4.9K high-quality problems in total, validated through a rigorous four-stage context construction and validation framework.<n>Our study reveals that even state-of-the-art models often struggle to remain faithful to the given context, and that larger models do not necessarily exhibit improved faithfulness.
arXiv Detail & Related papers (2024-09-30T06:27:53Z)
NeedleBench: Evaluating LLM Retrieval and Reasoning Across Varying Information Densities [51.07379913779232]
NeedleBench is a framework for assessing retrieval and reasoning performance in long-context tasks.<n>It embeds key data points at varying depths to rigorously test model capabilities.<n>Our experiments reveal that reasoning models like Deep-R1 and OpenAI's o3 struggle with continuous retrieval and reasoning in information-dense scenarios.
arXiv Detail & Related papers (2024-07-16T17:59:06Z)
Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA [71.04146366608904]
Long-context modeling capabilities have garnered widespread attention, leading to the emergence of Large Language Models (LLMs) with ultra-context windows. We propose a novel long-context benchmark, Loong, aligning with realistic scenarios through extended multi-document question answering (QA) Loong introduces four types of tasks with a range of context lengths: Spotlight Locating, Comparison, Clustering, and Chain of Reasoning.
arXiv Detail & Related papers (2024-06-25T09:42:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.