Related papers: Investigating Factuality in Long-Form Text Generation: The Roles of Self-Known and Self-Unknown

Investigating Factuality in Long-Form Text Generation: The Roles of Self-Known and Self-Unknown

URL: http://arxiv.org/abs/2411.15993v2
Date: Thu, 25 Sep 2025 02:28:07 GMT
Title: Investigating Factuality in Long-Form Text Generation: The Roles of Self-Known and Self-Unknown
Authors: Lifu Tu, Rui Meng, Shafiq Joty, Yingbo Zhou, Semih Yavuz,
Abstract summary: We investigate the factuality of long-form text generation across various large language models (LLMs)<n>Our analysis reveals that factuality tend to decline in later sentences of the generated text, accompanied by a rise in the number of unsupported claims.
Score: 68.33486915047014
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large language models (LLMs) have demonstrated strong capabilities in text understanding and generation. However, they often lack factuality, producing a mixture of true and false information, especially in long-form generation. In this work, we investigates the factuality of long-form text generation across various large language models (LLMs), including GPT-4, Gemini-1.5-Pro, Claude-3-Opus, Llama-3-70B, and Mistral. Our analysis reveals that factuality tend to decline in later sentences of the generated text, accompanied by a rise in the number of unsupported claims. Furthermore, we explore the effectiveness of different evaluation settings to assess whether LLMs can accurately judge the correctness of their own outputs: Self-Known (the percentage of supported atomic claims, decomposed from LLM outputs, that the corresponding LLMs judge as correct) and Self-Unknown (the percentage of unsupported atomic claims that the corresponding LLMs judge as incorrect). Empirically, we observe a positive correlation between higher Self-Known scores and improved factuality, whereas higher Self-Unknown scores are associated with reduced factuality. Interestingly, the number of unsupported claims can increase even without significant changes in a model's self-judgment scores (Self-Known and Self-Unknown), likely as a byproduct of long-form text generation. We also derive a mathematical framework linking Self-Known and Self-Unknown scores to factuality: $\textrm{Factuality}=\frac{1-\textrm{Self-Unknown}}{2-\textrm{Self-Unknown}-\textrm{Self-Known}}$, which aligns with our empirical observations. Additional Retrieval-Augmented Generation (RAG) experiments further highlight the limitations of current LLMs in long-form generation and underscore the need for continued research to improve factuality in long-form text.

Related papers

Factual Self-Awareness in Language Models: Representation, Robustness, and Scaling [56.26834106704781]
Factual incorrectness in generated content is one of the primary concerns in ubiquitous deployment of large language models (LLMs)<n>We provide evidence supporting the presence of LLMs' internal compass that dictate the correctness of factual recall at the time of generation.<n>Scaling experiments across model sizes and training dynamics highlight that self-awareness emerges rapidly during training and peaks in intermediate layers.
arXiv Detail & Related papers (2025-05-27T16:24:02Z)
Misinforming LLMs: vulnerabilities, challenges and opportunities [4.54019093815234]
Large Language Models (LLMs) have made significant advances in natural language processing, but their underlying mechanisms are often misunderstood. This paper argues that current LLM architectures are inherently untrustworthy due to their reliance on correlations of sequential patterns of word embedding vectors. Research into combining generative transformer-based models with fact bases and logic programming languages may lead to the development of trustworthy LLMs.
arXiv Detail & Related papers (2024-08-02T10:35:49Z)
To Know or Not To Know? Analyzing Self-Consistency of Large Language Models under Ambiguity [27.10502683001428]
This paper focuses on entity type ambiguity, analyzing the proficiency and consistency of state-of-the-art LLMs in applying factual knowledge when prompted with ambiguous entities. Experiments reveal that LLMs struggle with choosing the correct entity reading, achieving an average accuracy of only 85%, and as low as 75% with underspecified prompts.
arXiv Detail & Related papers (2024-07-24T09:48:48Z)
Large Language Models have Intrinsic Self-Correction Ability [16.831123666582755]
Large language models suffer from hallucinations that will cause performance degradation. One promising solution to improve the LLMs' performance is to ask LLMs to revise their answer after generation. In intrinsic self-correction is considered a promising direction because it does not utilize external knowledge.
arXiv Detail & Related papers (2024-06-21T22:29:40Z)
Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement [75.7148545929689]
Large language models (LLMs) improve their performance through self-feedback on certain tasks while degrade on others. We formally define LLM's self-bias - the tendency to favor its own generation. We analyze six LLMs on translation, constrained text generation, and mathematical reasoning tasks.
arXiv Detail & Related papers (2024-02-18T03:10:39Z)
Self-Alignment for Factuality: Mitigating Hallucinations in LLMs via Self-Evaluation [71.91287418249688]
Large language models (LLMs) often struggle with factual inaccuracies, even when they hold relevant knowledge. We leverage the self-evaluation capability of an LLM to provide training signals that steer the model towards factuality. We show that the proposed self-alignment approach substantially enhances factual accuracy over Llama family models across three key knowledge-intensive tasks.
arXiv Detail & Related papers (2024-02-14T15:52:42Z)
What Large Language Models Know and What People Think They Know [13.939511057660013]
Large language models (LLMs) are increasingly integrated into decision-making processes. To earn human trust, LLMs must be well calibrated so that they can accurately assess and communicate the likelihood of their predictions being correct. Here we explore the calibration gap, which refers to the difference between human confidence in LLM-generated answers and the models' actual confidence, and the discrimination gap, which reflects how well humans and models can distinguish between correct and incorrect answers.
arXiv Detail & Related papers (2024-01-24T22:21:04Z)
The ART of LLM Refinement: Ask, Refine, and Trust [85.75059530612882]
We propose a reasoning with refinement objective called ART: Ask, Refine, and Trust. It asks necessary questions to decide when an LLM should refine its output. It achieves a performance gain of +5 points over self-refinement baselines.
arXiv Detail & Related papers (2023-11-14T07:26:32Z)
Are Large Language Models Reliable Judges? A Study on the Factuality Evaluation Capabilities of LLMs [8.526956860672698]
Large Language Models (LLMs) have gained immense attention due to their notable emergent capabilities. This study investigates the potential of LLMs as reliable assessors of factual consistency in summaries generated by text-generation models.
arXiv Detail & Related papers (2023-11-01T17:42:45Z)
Language Models Hallucinate, but May Excel at Fact Verification [89.0833981569957]
Large language models (LLMs) frequently "hallucinate," resulting in non-factual outputs. Even GPT-3.5 produces factual outputs less than 25% of the time. This underscores the importance of fact verifiers in order to measure and incentivize progress.
arXiv Detail & Related papers (2023-10-23T04:39:01Z)
Improving the Reliability of Large Language Models by Leveraging Uncertainty-Aware In-Context Learning [76.98542249776257]
Large-scale language models often face the challenge of "hallucination" We introduce an uncertainty-aware in-context learning framework to empower the model to enhance or reject its output in response to uncertainty.
arXiv Detail & Related papers (2023-10-07T12:06:53Z)
Large Language Models Cannot Self-Correct Reasoning Yet [78.16697476530994]
Large Language Models (LLMs) have emerged as a groundbreaking technology with their unparalleled text generation capabilities. Concerns persist regarding the accuracy and appropriateness of their generated content. A contemporary methodology, self-correction, has been proposed as a remedy to these issues.
arXiv Detail & Related papers (2023-10-03T04:56:12Z)
FELM: Benchmarking Factuality Evaluation of Large Language Models [40.78878196872095]
We introduce a benchmark for Factuality Evaluation of large Language Models, referred to as felm. We collect responses generated from large language models and annotate factuality labels in a fine-grained manner. Our findings reveal that while retrieval aids factuality evaluation, current LLMs are far from satisfactory to faithfully detect factual errors.
arXiv Detail & Related papers (2023-10-01T17:37:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.