Multi-group Uncertainty Quantification for Long-form Text Generation
- URL: http://arxiv.org/abs/2407.21057v1
- Date: Thu, 25 Jul 2024 02:59:52 GMT
- Title: Multi-group Uncertainty Quantification for Long-form Text Generation
- Authors: Terrance Liu, Zhiwei Steven Wu,
- Abstract summary: We study the problem of uncertainty quantification of factual correctness in long-form natural language generation.
We invoke multicalibration and multivalid conformal prediction to ensure that such uncertainty guarantees are valid both marginally and across distinct groups of prompts.
- Score: 29.65035492536852
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While large language models are rapidly moving towards consumer-facing applications, they are often still prone to factual errors and hallucinations. In order to reduce the potential harms that may come from these errors, it is important for users to know to what extent they can trust an LLM when it makes a factual claim. To this end, we study the problem of uncertainty quantification of factual correctness in long-form natural language generation. Given some output from a large language model, we study both uncertainty at the level of individual claims contained within the output (via calibration) and uncertainty across the entire output itself (via conformal prediction). Moreover, we invoke multicalibration and multivalid conformal prediction to ensure that such uncertainty guarantees are valid both marginally and across distinct groups of prompts. Using the task of biography generation, we demonstrate empirically that having access to and making use of additional group attributes for each prompt improves both overall and group-wise performance. As the problems of calibration, conformal prediction, and their multi-group counterparts have not been extensively explored previously in the context of long-form text generation, we consider these empirical results to form a benchmark for this setting.
Related papers
- Conformal Prediction Adaptive to Unknown Subpopulation Shifts [11.046912341345294]
Conformal prediction is widely used to equip black-box machine learning models with uncertainty quantification enjoying formal coverage guarantees.<n>In this work, we address subpopulation shifts, where the test environment exhibits an unknown and differing mixture of subpopulations compared to the calibration data.<n>We propose new methods that provably adapt conformal prediction to such shifts, ensuring valid coverage without requiring explicit knowledge of subpopulation structure.
arXiv Detail & Related papers (2025-06-05T20:58:39Z) - On the Interconnections of Calibration, Quantification, and Classifier Accuracy Prediction under Dataset Shift [58.91436551466064]
This paper investigates the interconnections among three fundamental problems, calibration, and quantification, under dataset shift conditions.<n>We show that access to an oracle for any one of these tasks enables the resolution of the other two.<n>We propose new methods for each problem based on direct adaptations of well-established methods borrowed from the other disciplines.
arXiv Detail & Related papers (2025-05-16T15:42:55Z) - Investigating Factuality in Long-Form Text Generation: The Roles of Self-Known and Self-Unknown [55.91887554462312]
We investigate the factuality of long-form text generation across various large language models (LLMs)
Our analysis reveals that factuality scores tend to decline in later sentences of the generated text, accompanied by a rise in the number of unsupported claims.
We find a correlation between higher Self-Known scores and improved factuality, while higher Self-Unknown scores are associated with lower factuality.
arXiv Detail & Related papers (2024-11-24T22:06:26Z) - PATH: A Discrete-sequence Dataset for Evaluating Online Unsupervised Anomaly Detection Approaches for Multivariate Time Series [0.01874930567916036]
Benchmarking anomaly detection approaches for multivariate time series is a challenging task due to a lack of high-quality datasets.<n>We propose a solution: a diverse, extensive, and non-trivial dataset generated via state-of-the-art simulation tools.<n>Our dataset represents a discrete-sequence problem, which remains unaddressed by previously-proposed solutions in literature.
arXiv Detail & Related papers (2024-11-21T09:03:12Z) - Epistemic Integrity in Large Language Models [11.173637560124828]
Large language models are increasingly relied upon sources of information, but their propensity for false or misleading statements poses high risks for users and society.
In this paper, we confront the critical problem of miscalibration where a model's linguistic assertiveness fails to reflect its true internal certainty.
We introduce a new human misalignment evaluation and a novel method for measuring the linguistic assertiveness of Large Language Models.
arXiv Detail & Related papers (2024-11-10T17:10:13Z) - On Uncertainty In Natural Language Processing [2.5076643086429993]
This thesis studies how uncertainty in natural language processing can be characterized from a linguistic, statistical and neural perspective.
We propose a method for calibrated sampling in natural language generation based on non-exchangeable conformal prediction.
Lastly, we develop an approach to quantify confidence in large black-box language models using auxiliary predictors.
arXiv Detail & Related papers (2024-10-04T14:08:02Z) - Finetuning Language Models to Emit Linguistic Expressions of Uncertainty [5.591074369497796]
Large language models (LLMs) are increasingly employed in information-seeking and decision-making tasks.
LLMs tend to generate information that conflicts with real-world facts, and their persuasive style can make these inaccuracies appear confident and convincing.
In this work, we explore supervised finetuning on uncertainty-augmented predictions as a method to develop models that produce linguistic expressions of uncertainty.
arXiv Detail & Related papers (2024-09-18T17:52:53Z) - Balancing Diversity and Risk in LLM Sampling: How to Select Your Method and Parameter for Open-Ended Text Generation [60.493180081319785]
We propose a systematic way to estimate the capacity of a truncation sampling method by considering the trade-off between diversity and risk at each decoding step.<n>Our work offers a comprehensive comparison of existing truncation sampling methods and serves as a practical user guideline for their parameter selection.
arXiv Detail & Related papers (2024-08-24T14:14:32Z) - Editable Fairness: Fine-Grained Bias Mitigation in Language Models [52.66450426729818]
We propose a novel debiasing approach, Fairness Stamp (FAST), which enables fine-grained calibration of individual social biases.
FAST surpasses state-of-the-art baselines with superior debiasing performance.
This highlights the potential of fine-grained debiasing strategies to achieve fairness in large language models.
arXiv Detail & Related papers (2024-08-07T17:14:58Z) - Language Model Cascades: Token-level uncertainty and beyond [65.38515344964647]
Recent advances in language models (LMs) have led to significant improvements in quality on complex NLP tasks.
Cascading offers a simple strategy to achieve more favorable cost-quality tradeoffs.
We show that incorporating token-level uncertainty through learned post-hoc deferral rules can significantly outperform simple aggregation strategies.
arXiv Detail & Related papers (2024-04-15T21:02:48Z) - Linguistic Calibration of Long-Form Generations [57.836339732160916]
Language models (LMs) may lead their users to make suboptimal downstream decisions when they confidently hallucinate.
This issue can be mitigated by having the LM verbally convey the probability that its claims are correct, but existing models cannot produce long-form text with calibrated confidence statements.
We define linguistic calibration for long-form generations: an LM is linguistically calibrated if its generations enable its users to make calibrated probabilistic predictions.
arXiv Detail & Related papers (2024-03-30T20:47:55Z) - Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification [116.77055746066375]
Large language models (LLMs) are notorious for hallucinating, i.e., producing erroneous claims in their output.
We propose a novel fact-checking and hallucination detection pipeline based on token-level uncertainty quantification.
arXiv Detail & Related papers (2024-03-07T17:44:17Z) - Decomposing Uncertainty for Large Language Models through Input Clarification Ensembling [69.83976050879318]
In large language models (LLMs), identifying sources of uncertainty is an important step toward improving reliability, trustworthiness, and interpretability.
In this paper, we introduce an uncertainty decomposition framework for LLMs, called input clarification ensembling.
Our approach generates a set of clarifications for the input, feeds them into an LLM, and ensembles the corresponding predictions.
arXiv Detail & Related papers (2023-11-15T05:58:35Z) - Is this model reliable for everyone? Testing for strong calibration [4.893345190925178]
In a well-calibrated risk prediction model, the average predicted probability is close to the true event rate for any given subgroup.
The task of auditing a model for strong calibration is well-known to be difficult due to the sheer number of potential subgroups.
Recent developments in goodness-of-fit testing offer potential solutions but are not designed for settings with weak signal.
arXiv Detail & Related papers (2023-07-28T00:59:14Z) - Conformal Language Modeling [61.94417935386489]
We propose a novel approach to conformal prediction for generative language models (LMs)
Standard conformal prediction produces prediction sets with rigorous, statistical guarantees.
We demonstrate the promise of our approach on multiple tasks in open-domain question answering, text summarization, and radiology report generation.
arXiv Detail & Related papers (2023-06-16T21:55:08Z) - Achieving Long-term Fairness in Submodular Maximization through
Randomization [16.33001220320682]
It is important to implement fairness-aware algorithms when dealing with data items that may contain sensitive attributes like race or gender.
We investigate the problem of maximizing a monotone submodular function while meeting group fairness constraints.
arXiv Detail & Related papers (2023-04-10T16:39:19Z) - Towards Group Robustness in the presence of Partial Group Labels [61.33713547766866]
spurious correlations between input samples and the target labels wrongly direct the neural network predictions.
We propose an algorithm that optimize for the worst-off group assignments from a constraint set.
We show improvements in the minority group's performance while preserving overall aggregate accuracy across groups.
arXiv Detail & Related papers (2022-01-10T22:04:48Z) - Evaluating Factuality in Generation with Dependency-level Entailment [57.5316011554622]
We propose a new formulation of entailment that decomposes it at the level of dependency arcs.
We show that our dependency arc entailment model trained on this data can identify factual inconsistencies in paraphrasing and summarization better than sentence-level methods.
arXiv Detail & Related papers (2020-10-12T06:43:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.