UNCLE: Uncertainty Expressions in Long-Form Generation
- URL: http://arxiv.org/abs/2505.16922v1
- Date: Thu, 22 May 2025 17:16:08 GMT
- Title: UNCLE: Uncertainty Expressions in Long-Form Generation
- Authors: Ruihan Yang, Caiqi Zhang, Zhisong Zhang, Xinting Huang, Dong Yu, Nigel Collier, Deqing Yang,
- Abstract summary: Large Language Models (LLMs) are prone to hallucination, particularly in long-form generations.<n>We introduce UNCLE, a benchmark designed to evaluate uncertainty expression in both long- and short-form question answering (QA)<n>Our dataset is the first to directly bridge short- and long-form QA with paired questions and gold-standard answers.
- Score: 48.7696074873262
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) are prone to hallucination, particularly in long-form generations. A promising direction to mitigate hallucination is to teach LLMs to express uncertainty explicitly when they lack sufficient knowledge. However, existing work lacks direct and fair evaluation of LLMs' ability to express uncertainty effectively in long-form generation. To address this gap, we first introduce UNCLE, a benchmark designed to evaluate uncertainty expression in both long- and short-form question answering (QA). UNCLE spans five domains and comprises 4k long-form QA instances and over 20k short-form QA pairs. Our dataset is the first to directly bridge short- and long-form QA with paired questions and gold-standard answers. Along with the benchmark, we propose a suite of new metrics to assess the models' capabilities to selectively express uncertainty. Using UNCLE, we then demonstrate that current models fail to convey uncertainty appropriately in long-form generation. We further explore both prompt-based and training-based methods to improve models' performance, with the training-based methods yielding greater gains. Further analysis of alignment gaps between short- and long-form uncertainty expression highlights promising directions for future research using UNCLE.
Related papers
- UNCERTAINTY-LINE: Length-Invariant Estimation of Uncertainty for Large Language Models [34.52549605613087]
We show that UNCERTAINTY-LINE: consistently improves over even nominally length-normalized UQ methods uncertainty estimates.<n>Our method is post-hoc, model-agnostic, and applicable to a range of UQ measures.
arXiv Detail & Related papers (2025-05-25T09:30:43Z) - Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning [80.27561080938747]
We propose a systematic framework, CANOE, to improve the faithfulness of large language models (LLMs) in both short-form and long-form generation tasks without human annotations.<n>Also, we propose Dual-GRPO, a rule-based reinforcement learning method that includes three tailored rule-based rewards derived from synthesized short-form QA data.<n> Experimental results show that CANOE greatly improves the faithfulness of LLMs across 11 different downstream tasks, even outperforming the most advanced LLMs.
arXiv Detail & Related papers (2025-05-22T10:10:07Z) - Short-Path Prompting in LLMs: Analyzing Reasoning Instability and Solutions for Robust Performance [33.16322104912836]
Large language models' (LLMs) reasoning is largely due to the chain-of-thought (CoT) approaches.<n>LLMs are instruction-tuned to provide long and detailed CoT pathways when responding to reasoning-related questions.<n>Human beings are naturally cognitive misers and will prompt language models to give rather short responses.
arXiv Detail & Related papers (2025-04-13T14:12:14Z) - LoGU: Long-form Generation with Uncertainty Expressions [49.76417603761989]
We introduce the task of Long-form Generation with Uncertainty(LoGU)
We identify two key challenges: Uncertainty Suppression and Uncertainty Misalignment.
Our framework adopts a divide-and-conquer strategy, refining uncertainty based on atomic claims.
Experiments on three long-form instruction following datasets show that our method significantly improves accuracy, reduces hallucinations, and maintains the comprehensiveness of responses.
arXiv Detail & Related papers (2024-10-18T09:15:35Z) - Unconditional Truthfulness: Learning Conditional Dependency for Uncertainty Quantification of Large Language Models [96.43562963756975]
We train a regression model, which target variable is the gap between the conditional and the unconditional generation confidence.
We use this learned conditional dependency model to modulate the uncertainty of the current generation step based on the uncertainty of the previous step.
arXiv Detail & Related papers (2024-08-20T09:42:26Z) - LUQ: Long-text Uncertainty Quantification for LLMs [29.987010627250527]
Large Language Models (LLMs) are prone to generate nonfactual content.
Uncertainty Quantification (UQ) is pivotal in enhancing our understanding of a model's confidence on its generation.
We propose textscLuq-Ensemble, a method that ensembles responses from multiple models and selects the response with the lowest uncertainty.
arXiv Detail & Related papers (2024-03-29T16:49:24Z) - Improving the Reliability of Large Language Models by Leveraging
Uncertainty-Aware In-Context Learning [76.98542249776257]
Large-scale language models often face the challenge of "hallucination"
We introduce an uncertainty-aware in-context learning framework to empower the model to enhance or reject its output in response to uncertainty.
arXiv Detail & Related papers (2023-10-07T12:06:53Z) - Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools.
Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions.
Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.