Self-Consistency Boosts Calibration for Math Reasoning
- URL: http://arxiv.org/abs/2403.09849v1
- Date: Thu, 14 Mar 2024 20:17:10 GMT
- Title: Self-Consistency Boosts Calibration for Math Reasoning
- Authors: Ante Wang, Linfeng Song, Ye Tian, Baolin Peng, Lifeng Jin, Haitao Mi, Jinsong Su, Dong Yu,
- Abstract summary: We design three off-the-shelf calibration methods based on self-consistency for math reasoning tasks.
Our methods better bridge model confidence and accuracy than existing methods based on p(True) or logit.
- Score: 69.82896431282927
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Calibration, which establishes the correlation between accuracy and model confidence, is important for LLM development. We design three off-the-shelf calibration methods based on self-consistency (Wang et al., 2022) for math reasoning tasks. Evaluation on two popular benchmarks (GSM8K and MathQA) using strong open-source LLMs (Mistral and LLaMA2), our methods better bridge model confidence and accuracy than existing methods based on p(True) (Kadavath et al., 2022) or logit (Kadavath et al., 2022).
Related papers
- Does Alignment Tuning Really Break LLMs' Internal Confidence? [5.893124686141782]
Large Language Models (LLMs) have shown remarkable progress, but their real-world application necessitates reliable calibration.
This study conducts a comprehensive analysis of calibration degradation of LLMs across four dimensions: models, calibration metrics, tasks, and confidence extraction methods.
arXiv Detail & Related papers (2024-08-31T05:12:36Z) - Confidence Calibration and Rationalization for LLMs via Multi-Agent Deliberation [18.815226646364476]
Existing calibration methods for large language models (LLMs) focus on estimating or eliciting individual confidence without taking full advantage of the "Collective Wisdom"
We propose Collaborative, a post-hoc training-free calibration strategy that leverages the collaborative and expressive capabilities of multiple tool-augmented LLM agents in a simulated group deliberation process.
arXiv Detail & Related papers (2024-04-14T02:40:43Z) - Calibrating Large Language Models with Sample Consistency [76.23956851098598]
We explore the potential of deriving confidence from the distribution of multiple randomly sampled model generations, via three measures of consistency.
Results show that consistency-based calibration methods outperform existing post-hoc approaches.
We offer practical guidance on choosing suitable consistency metrics for calibration, tailored to the characteristics of various LMs.
arXiv Detail & Related papers (2024-02-21T16:15:20Z) - Calibrating Long-form Generations from Large Language Models [34.72041258464477]
Large Language Models' (LLMs) confidence scores should align with the actual likelihood of its responses being correct.
Current confidence elicitation methods and calibration metrics rely on a binary true/false assessment of response correctness.
We introduce a unified calibration framework, in which both the correctness of the LLMs' responses and their associated confidence levels are treated as distributions across a range of scores.
arXiv Detail & Related papers (2024-02-09T17:00:32Z) - Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks.
We instruct an LLM to self-evaluate its answers.
We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z) - Estimating Large Language Model Capabilities without Labeled Test Data [51.428562302037534]
Large Language Models (LLMs) have the impressive ability to perform in-context learning (ICL) from only a few examples.
We propose the task of ICL accuracy estimation, in which we predict the accuracy of an LLM when doing in-context learning on a new task.
arXiv Detail & Related papers (2023-05-24T06:55:09Z) - A Close Look into the Calibration of Pre-trained Language Models [56.998539510508515]
Pre-trained language models (PLMs) may fail in giving reliable estimates of their predictive uncertainty.
We study the dynamic change in PLMs' calibration performance in training.
We extend two recently proposed learnable methods that directly collect data to train models to have reasonable confidence estimations.
arXiv Detail & Related papers (2022-10-31T21:31:07Z) - Modular Conformal Calibration [80.33410096908872]
We introduce a versatile class of algorithms for recalibration in regression.
This framework allows one to transform any regression model into a calibrated probabilistic model.
We conduct an empirical study of MCC on 17 regression datasets.
arXiv Detail & Related papers (2022-06-23T03:25:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.