Program-Aided Reasoners (better) Know What They Know
- URL: http://arxiv.org/abs/2311.09553v1
- Date: Thu, 16 Nov 2023 04:17:49 GMT
- Title: Program-Aided Reasoners (better) Know What They Know
- Authors: Anubha Kabra, Sanketh Rangreji, Yash Mathur, Aman Madaan, Emmy Liu,
Graham Neubig
- Abstract summary: We compare the calibration of Program Aided Language Models (PAL) and text-based Chain-of-thought (COT) prompting techniques over 5 datasets.
Our results indicate that PAL leads to improved calibration in 75% of the instances.
- Score: 59.29201607431494
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prior work shows that program-aided reasoning, in which large language models
(LLMs) are combined with programs written in programming languages such as
Python, can significantly improve accuracy on various reasoning tasks. However,
while accuracy is essential, it is also important for such reasoners to "know
what they know", which can be quantified through the calibration of the model.
In this paper, we compare the calibration of Program Aided Language Models
(PAL) and text-based Chain-of-thought (COT) prompting techniques over 5
datasets and 2 model types: LLaMA models and OpenAI models. Our results
indicate that PAL leads to improved calibration in 75% of the instances. Our
analysis uncovers that prompting styles that produce lesser diversity in
generations also have more calibrated results, and thus we also experiment with
inducing lower generation diversity using temperature scaling and find that for
certain temperatures, PAL is not only more accurate but is also more calibrated
than COT. Overall, we demonstrate that, in the majority of cases, program-aided
reasoners better know what they know than text-based counterparts.
Related papers
- Calibrating Large Language Models with Sample Consistency [76.23956851098598]
We explore the potential of deriving confidence from the distribution of multiple randomly sampled model generations, via three measures of consistency.
Results show that consistency-based calibration methods outperform existing post-hoc approaches.
We offer practical guidance on choosing suitable consistency metrics for calibration, tailored to the characteristics of various LMs.
arXiv Detail & Related papers (2024-02-21T16:15:20Z) - On the Calibration of Multilingual Question Answering LLMs [57.296161186129545]
We benchmark the calibration of several multilingual Large Language Models (MLLMs) on a variety of Question Answering tasks.
We study different dimensions of calibration in in-distribution, out-of-distribution, and cross-lingual transfer settings.
For decoder-only LLMs such as LlaMa2, we additionally find that in-context learning improves confidence calibration on multilingual data.
arXiv Detail & Related papers (2023-11-15T03:29:02Z) - Fine-tuning Language Models for Factuality [96.5203774943198]
Large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines.
Yet language models are prone to making convincing but factually inaccurate claims, often referred to as 'hallucinations'
In this work, we fine-tune language models to be more factual, without human labeling.
arXiv Detail & Related papers (2023-11-14T18:59:15Z) - LitCab: Lightweight Language Model Calibration over Short- and Long-form
Responses [14.77013588561901]
We present LitCab, a lightweight calibration mechanism consisting of a single linear layer that takes the input text representation and predicts a bias term.
For evaluation, we construct CaT, a benchmark consisting of eight text generation tasks, covering responses ranging from short phrases to paragraphs.
We test LitCab with Llama2-7B, where it improves calibration across all tasks, reducing the average ECE score by as large as 30%.
arXiv Detail & Related papers (2023-10-30T00:30:34Z) - On the Calibration of Massively Multilingual Language Models [15.373725507698591]
Massively Multilingual Language Models (MMLMs) have recently gained popularity due to their surprising effectiveness in cross-lingual transfer.
We first investigate the calibration of MMLMs in the zero-shot setting and observe a clear case of miscalibration in low-resource languages.
We also find that few-shot examples in the language can further help reduce the calibration errors, often substantially.
arXiv Detail & Related papers (2022-10-21T21:41:56Z) - Are Larger Pretrained Language Models Uniformly Better? Comparing
Performance at the Instance Level [38.64433236359172]
We find that BERT-Large is worse than BERT-Mini on at least 1-4% of instances across MNLI, SST-2, and QQP.
Finetuning noise increases with model size and that instance-level accuracy has momentum.
Our findings suggest that instance-level predictions provide a rich source of information.
arXiv Detail & Related papers (2021-05-13T01:10:51Z) - How Can We Know When Language Models Know? On the Calibration of
Language Models for Question Answering [80.82194311274694]
We examine the question "how can we know when language models know, with confidence, the answer to a particular query?"
We examine three strong generative models -- T5, BART, and GPT-2 -- and study whether their probabilities on QA tasks are well calibrated.
We then examine methods to calibrate such models to make their confidence scores correlate better with the likelihood of correctness.
arXiv Detail & Related papers (2020-12-02T03:53:13Z) - The Right Tool for the Job: Matching Model and Instance Complexities [62.95183777679024]
As NLP models become larger, executing a trained model requires significant computational resources incurring monetary and environmental costs.
We propose a modification to contextual representation fine-tuning which, during inference, allows for an early (and fast) "exit"
We test our proposed modification on five different datasets in two tasks: three text classification datasets and two natural language inference benchmarks.
arXiv Detail & Related papers (2020-04-16T04:28:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.