How Does Response Length Affect Long-Form Factuality
- URL: http://arxiv.org/abs/2505.23295v1
- Date: Thu, 29 May 2025 09:47:56 GMT
- Title: How Does Response Length Affect Long-Form Factuality
- Authors: James Xu Zhao, Jimmy Z. J. Liu, Bryan Hooi, See-Kiong Ng,
- Abstract summary: Despite growing attention to factuality, the effect of response length on factuality remains underexplored.<n>We introduce an automatic and bi-level long-form factuality evaluation framework, which achieves high agreement with human annotations.<n>Using this framework, we find that longer responses exhibit lower factual precision, confirming the presence of length bias.
- Score: 44.91589620660189
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) are widely used for long-form text generation. However, factual errors in the responses would undermine their reliability. Despite growing attention to LLM factuality, the effect of response length on factuality remains underexplored. In this work, we systematically investigate this relationship by first introducing an automatic and bi-level long-form factuality evaluation framework, which achieves high agreement with human annotations while being cost-effective. Using this framework, we conduct controlled experiments and find that longer responses exhibit lower factual precision, confirming the presence of length bias. To explain this phenomenon, we empirically examine three hypotheses: error propagation, long context, and facts exhaustion. Our results reveal that facts exhaustion, where the model gradually exhausts more reliable knowledge, is the primary cause of factual degradation, rather than the other two hypotheses.
Related papers
- Context Shapes LLMs Retrieval-Augmented Fact-Checking Effectiveness [6.250095470690937]
We show that large language models (LLMs) show strong reasoning abilities across diverse tasks, yet their performance on extended contexts remains inconsistent.<n>We evaluate both factual knowledge and the impact of evidence placement across varying context lengths.
arXiv Detail & Related papers (2026-02-15T08:15:13Z) - Not All Needles Are Found: How Fact Distribution and Don't Make It Up Prompts Shape Literal Extraction, Logical Inference, and Hallucination Risks in Long-Context LLMs [0.0]
Large language models (LLMs) increasingly support very long input contexts.<n>It remains unclear how reliably they extract and infer information at scale.<n>We study how fact placement, corpus-level fact distributions, and Don't Make It Up prompts influence model behavior.
arXiv Detail & Related papers (2026-01-05T11:30:56Z) - Trace Length is a Simple Uncertainty Signal in Reasoning Models [18.432200654999082]
We show that reasoning trace length is a useful confidence estimator in large reasoning models.<n>Our work reveals that reasoning post-training fundamentally alters the relationship between trace length and accuracy.<n>We identify high-entropy or "forking" tokens as playing a key role in the mechanism.
arXiv Detail & Related papers (2025-10-12T02:04:06Z) - Between Underthinking and Overthinking: An Empirical Study of Reasoning Length and correctness in LLMs [52.405085773954596]
We find that large language models (LLMs) tend to overthink simple problems, generating unnecessarily long outputs, and underthink harder ones.<n>This indicates that models might misjudge problem difficulty and fail to calibrate their response length appropriately.<n> Experiments show that the generation length can be significantly reduced while maintaining acceptable accuracy.
arXiv Detail & Related papers (2025-04-30T18:48:06Z) - Long-term Causal Inference via Modeling Sequential Latent Confounding [49.64731441006396]
Long-term causal inference is an important but challenging problem across various scientific domains.<n>We propose an approach based on the Conditional Additive Equi-Confounding Bias (CAECB) assumption.<n>Our proposed assumption states a functional relationship between sequential confounding biases across temporal short-term outcomes.
arXiv Detail & Related papers (2025-02-26T09:56:56Z) - FactReasoner: A Probabilistic Approach to Long-Form Factuality Assessment for Large Language Models [59.171510592986735]
We propose FactReasoner, a new factuality assessor that relies on probabilistic reasoning to assess the factuality of a long-form generated response.<n>Our experiments on labeled and unlabeled benchmark datasets demonstrate clearly that FactReasoner improves considerably over state-of-the-art prompt-based approaches.
arXiv Detail & Related papers (2025-02-25T19:01:48Z) - LongFaith: Enhancing Long-Context Reasoning in LLMs with Faithful Synthetic Data [19.79929012055293]
LongFaith is a novel pipeline for synthesizing faithful long-context reasoning instruction datasets.<n>By integrating ground truth and citation-based reasoning prompts, we eliminate distractions and improve the accuracy of reasoning chains.
arXiv Detail & Related papers (2025-02-18T06:40:23Z) - Fostering Appropriate Reliance on Large Language Models: The Role of Explanations, Sources, and Inconsistencies [66.30619782227173]
Large language models (LLMs) can produce erroneous responses that sound fluent and convincing.<n>We identify several features of LLM responses that shape users' reliance.<n>We find that explanations increase reliance on both correct and incorrect responses.<n>We observe less reliance on incorrect responses when sources are provided or when explanations exhibit inconsistencies.
arXiv Detail & Related papers (2025-02-12T16:35:41Z) - Do Large Language Models Show Biases in Causal Learning? [3.0264418764647605]
Causal learning is the cognitive process of developing the capability of making causal inferences based on available information.<n>This research investigates whether large language models (LLMs) develop causal illusions.
arXiv Detail & Related papers (2024-12-13T19:03:48Z) - Investigating Factuality in Long-Form Text Generation: The Roles of Self-Known and Self-Unknown [68.33486915047014]
We investigate the factuality of long-form text generation across various large language models (LLMs)<n>Our analysis reveals that factuality tend to decline in later sentences of the generated text, accompanied by a rise in the number of unsupported claims.
arXiv Detail & Related papers (2024-11-24T22:06:26Z) - Failure Modes of LLMs for Causal Reasoning on Narratives [51.19592551510628]
We investigate the causal reasoning abilities of large language models (LLMs) through the representative problem of inferring causal relationships from narratives.<n>We find that even state-of-the-art language models rely on unreliable shortcuts, both in terms of the narrative presentation and their parametric knowledge.
arXiv Detail & Related papers (2024-10-31T12:48:58Z) - Explaining Length Bias in LLM-Based Preference Evaluations [51.07275977870145]
We decompose the preference evaluation metric, specifically the win rate, into two key components: desirability and information mass.<n>We show that response length impacts evaluations by influencing information mass.<n>We propose AdapAlpaca, a simple yet effective adjustment to win rate measurement.
arXiv Detail & Related papers (2024-07-01T08:37:41Z) - Know When To Stop: A Study of Semantic Drift in Text Generation [9.76171773410722]
We show that modern LLMs tend to generate correct facts first, then "drift away" and generate incorrect facts later.
This correct-then-incorrect generation pattern suggests that factual accuracy can be improved by knowing when to stop generation.
arXiv Detail & Related papers (2024-04-08T11:25:30Z) - When Large Language Models contradict humans? Large Language Models' Sycophantic Behaviour [0.8133739801185272]
We show that Large Language Models (LLMs) show sycophantic tendencies when responding to queries involving subjective opinions and statements.
LLMs at various scales seem not to follow the users' hints by demonstrating confidence in delivering the correct answers.
arXiv Detail & Related papers (2023-11-15T22:18:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.