How Does Response Length Affect Long-Form Factuality
- URL: http://arxiv.org/abs/2505.23295v1
- Date: Thu, 29 May 2025 09:47:56 GMT
- Title: How Does Response Length Affect Long-Form Factuality
- Authors: James Xu Zhao, Jimmy Z. J. Liu, Bryan Hooi, See-Kiong Ng,
- Abstract summary: Despite growing attention to factuality, the effect of response length on factuality remains underexplored.<n>We introduce an automatic and bi-level long-form factuality evaluation framework, which achieves high agreement with human annotations.<n>Using this framework, we find that longer responses exhibit lower factual precision, confirming the presence of length bias.
- Score: 44.91589620660189
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) are widely used for long-form text generation. However, factual errors in the responses would undermine their reliability. Despite growing attention to LLM factuality, the effect of response length on factuality remains underexplored. In this work, we systematically investigate this relationship by first introducing an automatic and bi-level long-form factuality evaluation framework, which achieves high agreement with human annotations while being cost-effective. Using this framework, we conduct controlled experiments and find that longer responses exhibit lower factual precision, confirming the presence of length bias. To explain this phenomenon, we empirically examine three hypotheses: error propagation, long context, and facts exhaustion. Our results reveal that facts exhaustion, where the model gradually exhausts more reliable knowledge, is the primary cause of factual degradation, rather than the other two hypotheses.
Related papers
- Between Underthinking and Overthinking: An Empirical Study of Reasoning Length and correctness in LLMs [52.405085773954596]
We find that large language models (LLMs) tend to overthink simple problems, generating unnecessarily long outputs, and underthink harder ones.<n>This indicates that models might misjudge problem difficulty and fail to calibrate their response length appropriately.<n> Experiments show that the generation length can be significantly reduced while maintaining acceptable accuracy.
arXiv Detail & Related papers (2025-04-30T18:48:06Z) - Long-term Causal Inference via Modeling Sequential Latent Confounding [49.64731441006396]
Long-term causal inference is an important but challenging problem across various scientific domains.<n>We propose an approach based on the Conditional Additive Equi-Confounding Bias (CAECB) assumption.<n>Our proposed assumption states a functional relationship between sequential confounding biases across temporal short-term outcomes.
arXiv Detail & Related papers (2025-02-26T09:56:56Z) - FactReasoner: A Probabilistic Approach to Long-Form Factuality Assessment for Large Language Models [59.171510592986735]
We propose FactReasoner, a new factuality assessor that relies on probabilistic reasoning to assess the factuality of a long-form generated response.<n>Our experiments on labeled and unlabeled benchmark datasets demonstrate clearly that FactReasoner improves considerably over state-of-the-art prompt-based approaches.
arXiv Detail & Related papers (2025-02-25T19:01:48Z) - LongFaith: Enhancing Long-Context Reasoning in LLMs with Faithful Synthetic Data [19.79929012055293]
LongFaith is a novel pipeline for synthesizing faithful long-context reasoning instruction datasets.<n>By integrating ground truth and citation-based reasoning prompts, we eliminate distractions and improve the accuracy of reasoning chains.
arXiv Detail & Related papers (2025-02-18T06:40:23Z) - Fostering Appropriate Reliance on Large Language Models: The Role of Explanations, Sources, and Inconsistencies [66.30619782227173]
Large language models (LLMs) can produce erroneous responses that sound fluent and convincing.<n>We identify several features of LLM responses that shape users' reliance.<n>We find that explanations increase reliance on both correct and incorrect responses.<n>We observe less reliance on incorrect responses when sources are provided or when explanations exhibit inconsistencies.
arXiv Detail & Related papers (2025-02-12T16:35:41Z) - Do Large Language Models Show Biases in Causal Learning? [3.0264418764647605]
Causal learning is the cognitive process of developing the capability of making causal inferences based on available information.<n>This research investigates whether large language models (LLMs) develop causal illusions.
arXiv Detail & Related papers (2024-12-13T19:03:48Z) - Failure Modes of LLMs for Causal Reasoning on Narratives [51.19592551510628]
We investigate the causal reasoning abilities of large language models (LLMs) through the representative problem of inferring causal relationships from narratives.<n>We find that even state-of-the-art language models rely on unreliable shortcuts, both in terms of the narrative presentation and their parametric knowledge.
arXiv Detail & Related papers (2024-10-31T12:48:58Z) - Explaining Length Bias in LLM-Based Preference Evaluations [51.07275977870145]
We decompose the preference evaluation metric, specifically the win rate, into two key components: desirability and information mass.<n>We show that response length impacts evaluations by influencing information mass.<n>We propose AdapAlpaca, a simple yet effective adjustment to win rate measurement.
arXiv Detail & Related papers (2024-07-01T08:37:41Z) - Know When To Stop: A Study of Semantic Drift in Text Generation [9.76171773410722]
We show that modern LLMs tend to generate correct facts first, then "drift away" and generate incorrect facts later.
This correct-then-incorrect generation pattern suggests that factual accuracy can be improved by knowing when to stop generation.
arXiv Detail & Related papers (2024-04-08T11:25:30Z) - When Large Language Models contradict humans? Large Language Models' Sycophantic Behaviour [0.8133739801185272]
We show that Large Language Models (LLMs) show sycophantic tendencies when responding to queries involving subjective opinions and statements.
LLMs at various scales seem not to follow the users' hints by demonstrating confidence in delivering the correct answers.
arXiv Detail & Related papers (2023-11-15T22:18:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.