Linguistic Calibration of Long-Form Generations
- URL: http://arxiv.org/abs/2404.00474v2
- Date: Tue, 4 Jun 2024 22:39:58 GMT
- Title: Linguistic Calibration of Long-Form Generations
- Authors: Neil Band, Xuechen Li, Tengyu Ma, Tatsunori Hashimoto,
- Abstract summary: Language models (LMs) may lead their users to make suboptimal downstream decisions when they confidently hallucinate.
This issue can be mitigated by having the LM verbally convey the probability that its claims are correct, but existing models cannot produce long-form text with calibrated confidence statements.
We define linguistic calibration for long-form generations: an LM is linguistically calibrated if its generations enable its users to make calibrated probabilistic predictions.
- Score: 57.836339732160916
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language models (LMs) may lead their users to make suboptimal downstream decisions when they confidently hallucinate. This issue can be mitigated by having the LM verbally convey the probability that its claims are correct, but existing models cannot produce long-form text with calibrated confidence statements. Through the lens of decision-making, we define linguistic calibration for long-form generations: an LM is linguistically calibrated if its generations enable its users to make calibrated probabilistic predictions. This definition enables a training framework where a supervised finetuning step bootstraps an LM to emit long-form generations with confidence statements such as "I estimate a 30% chance of..." or "I am certain that...", followed by a reinforcement learning step which rewards generations that enable a user to provide calibrated answers to related questions. We linguistically calibrate Llama 2 7B and find in automated and human evaluations of long-form generations that it is significantly more calibrated than strong finetuned factuality baselines with comparable accuracy. These findings generalize under significant domain shifts to scientific and biomedical questions and to an entirely held-out person biography generation task. Our results demonstrate that long-form generations may be calibrated end-to-end by constructing an objective in the space of the predictions that users make in downstream decision-making.
Related papers
- Investigating Factuality in Long-Form Text Generation: The Roles of Self-Known and Self-Unknown [55.91887554462312]
We investigate the factuality of long-form text generation across various large language models (LLMs)
Our analysis reveals that factuality scores tend to decline in later sentences of the generated text, accompanied by a rise in the number of unsupported claims.
We find a correlation between higher Self-Known scores and improved factuality, while higher Self-Unknown scores are associated with lower factuality.
arXiv Detail & Related papers (2024-11-24T22:06:26Z) - On Uncertainty In Natural Language Processing [2.5076643086429993]
This thesis studies how uncertainty in natural language processing can be characterized from a linguistic, statistical and neural perspective.
We propose a method for calibrated sampling in natural language generation based on non-exchangeable conformal prediction.
Lastly, we develop an approach to quantify confidence in large black-box language models using auxiliary predictors.
arXiv Detail & Related papers (2024-10-04T14:08:02Z) - Finetuning Language Models to Emit Linguistic Expressions of Uncertainty [5.591074369497796]
Large language models (LLMs) are increasingly employed in information-seeking and decision-making tasks.
LLMs tend to generate information that conflicts with real-world facts, and their persuasive style can make these inaccuracies appear confident and convincing.
In this work, we explore supervised finetuning on uncertainty-augmented predictions as a method to develop models that produce linguistic expressions of uncertainty.
arXiv Detail & Related papers (2024-09-18T17:52:53Z) - Multi-group Uncertainty Quantification for Long-form Text Generation [29.65035492536852]
We study the problem of uncertainty quantification of factual correctness in long-form natural language generation.
We invoke multicalibration and multivalid conformal prediction to ensure that such uncertainty guarantees are valid both marginally and across distinct groups of prompts.
arXiv Detail & Related papers (2024-07-25T02:59:52Z) - Predict the Next Word: Humans exhibit uncertainty in this task and language models _____ [7.581259361859477]
Language models (LMs) are trained to assign probability to human-generated text.
We exploit this fact and evaluate the LM's ability to reproduce variability that humans exhibit in the 'next word prediction' task.
We assess GPT2, BLOOM and ChatGPT and find that they exhibit fairly low calibration to human uncertainty.
arXiv Detail & Related papers (2024-02-27T14:11:32Z) - Fine-tuning Language Models for Factuality [96.5203774943198]
Large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines.
Yet language models are prone to making convincing but factually inaccurate claims, often referred to as 'hallucinations'
In this work, we fine-tune language models to be more factual, without human labeling.
arXiv Detail & Related papers (2023-11-14T18:59:15Z) - Bridging the Gap Between Training and Inference of Bayesian Controllable
Language Models [58.990214815032495]
Large-scale pre-trained language models have achieved great success on natural language generation tasks.
BCLMs have been shown to be efficient in controllable language generation.
We propose a "Gemini Discriminator" for controllable language generation which alleviates the mismatch problem with a small computational cost.
arXiv Detail & Related papers (2022-06-11T12:52:32Z) - Factuality Enhanced Language Models for Open-Ended Text Generation [60.27166549575472]
We design the FactualityPrompts test set and metrics to measure the factuality of LM generations.
We find that larger LMs are more factual than smaller ones, although a previous study suggests that larger LMs can be less truthful in terms of misconceptions.
We propose a factuality-enhanced training method that uses TopicPrefix for better awareness of facts and sentence completion.
arXiv Detail & Related papers (2022-06-09T17:16:43Z) - How Can We Know When Language Models Know? On the Calibration of
Language Models for Question Answering [80.82194311274694]
We examine the question "how can we know when language models know, with confidence, the answer to a particular query?"
We examine three strong generative models -- T5, BART, and GPT-2 -- and study whether their probabilities on QA tasks are well calibrated.
We then examine methods to calibrate such models to make their confidence scores correlate better with the likelihood of correctness.
arXiv Detail & Related papers (2020-12-02T03:53:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.