Self-Consistency of Large Language Models under Ambiguity
- URL: http://arxiv.org/abs/2310.13439v1
- Date: Fri, 20 Oct 2023 11:57:56 GMT
- Title: Self-Consistency of Large Language Models under Ambiguity
- Authors: Henning Bartsch, Ole Jorgensen, Domenic Rosati, Jason
Hoelscher-Obermaier, Jacob Pfau
- Abstract summary: This work presents an evaluation benchmark for self-consistency in cases of under-specification.
We conduct a series of behavioral experiments on the OpenAI model suite using an ambiguous integer sequence completion task.
We find that average consistency ranges from 67% to 82%, far higher than would be predicted if a model's consistency was random.
- Score: 4.141513298907867
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) that do not give consistent answers across
contexts are problematic when used for tasks with expectations of consistency,
e.g., question-answering, explanations, etc. Our work presents an evaluation
benchmark for self-consistency in cases of under-specification where two or
more answers can be correct. We conduct a series of behavioral experiments on
the OpenAI model suite using an ambiguous integer sequence completion task. We
find that average consistency ranges from 67\% to 82\%, far higher than would
be predicted if a model's consistency was random, and increases as model
capability improves. Furthermore, we show that models tend to maintain
self-consistency across a series of robustness checks, including prompting
speaker changes and sequence length changes. These results suggest that
self-consistency arises as an emergent capability without specifically training
for it. Despite this, we find that models are uncalibrated when judging their
own consistency, with models displaying both over- and under-confidence. We
also propose a nonparametric test for determining from token output
distribution whether a model assigns non-trivial probability to alternative
answers. Using this test, we find that despite increases in self-consistency,
models usually place significant weight on alternative, inconsistent answers.
This distribution of probability mass provides evidence that even highly
self-consistent models internally compute multiple possible responses.
Related papers
- CONTESTS: a Framework for Consistency Testing of Span Probabilities in Language Models [16.436592723426305]
It is unclear whether language models produce the same value for different ways of assigning joint probabilities to word spans.
Our work introduces a novel framework, ConTestS, involving statistical tests to assess score consistency across interchangeable completion and conditioning orders.
arXiv Detail & Related papers (2024-09-30T06:24:43Z) - Uncertainty-aware Language Modeling for Selective Question Answering [107.47864420630923]
We present an automatic large language model (LLM) conversion approach that produces uncertainty-aware LLMs.
Our approach is model- and data-agnostic, is computationally-efficient, and does not rely on external models or systems.
arXiv Detail & Related papers (2023-11-26T22:47:54Z) - Calibrating Likelihoods towards Consistency in Summarization Models [22.023863165579602]
We argue that the main reason for such behavior is that the summarization models trained with maximum likelihood objective assign high probability to plausible sequences given the context.
In this work, we solve this problem by calibrating the likelihood of model generated sequences to better align with a consistency metric measured by natural language inference (NLI) models.
arXiv Detail & Related papers (2023-10-12T23:17:56Z) - Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs [78.31625291513589]
We argue that self-consistency is an important criteria for valid multi-step reasoning in tasks where the solution is composed of the answers to multiple sub-steps.
We propose two types of self-consistency that are particularly important for multi-step reasoning -- hypothetical consistency and compositional consistency.
We demonstrate that multiple variants of the GPT-3/-4 models exhibit poor consistency rates across both types of consistency on a variety of tasks.
arXiv Detail & Related papers (2023-05-23T17:25:59Z) - Realistic Conversational Question Answering with Answer Selection based
on Calibrated Confidence and Uncertainty Measurement [54.55643652781891]
Conversational Question Answering (ConvQA) models aim at answering a question with its relevant paragraph and previous question-answer pairs that occurred during conversation multiple times.
We propose to filter out inaccurate answers in the conversation history based on their estimated confidences and uncertainties from the ConvQA model.
We validate our models, Answer Selection-based realistic Conversation Question Answering, on two standard ConvQA datasets.
arXiv Detail & Related papers (2023-02-10T09:42:07Z) - Sharing pattern submodels for prediction with missing values [12.981974894538668]
Missing values are unavoidable in many applications of machine learning and present challenges both during training and at test time.
We propose an alternative approach, called sharing pattern submodels, which i) makes predictions robust to missing values at test time, ii) maintains or improves the predictive power of pattern submodels andiii) has a short description, enabling improved interpretability.
arXiv Detail & Related papers (2022-06-22T15:09:40Z) - Anomaly Detection of Time Series with Smoothness-Inducing Sequential
Variational Auto-Encoder [59.69303945834122]
We present a Smoothness-Inducing Sequential Variational Auto-Encoder (SISVAE) model for robust estimation and anomaly detection of time series.
Our model parameterizes mean and variance for each time-stamp with flexible neural networks.
We show the effectiveness of our model on both synthetic datasets and public real-world benchmarks.
arXiv Detail & Related papers (2021-02-02T06:15:15Z) - Wisdom of the Ensemble: Improving Consistency of Deep Learning Models [11.230300336108018]
Trust is often a function of constant behavior.
This paper studies a model behavior in the context of periodic retraining of deployed models.
We prove that consistency and correct-consistency of an ensemble learner is not less than the average consistency and correct-consistency of individual learners.
arXiv Detail & Related papers (2020-11-13T07:47:01Z) - On the Discrepancy between Density Estimation and Sequence Generation [92.70116082182076]
log-likelihood is highly correlated with BLEU when we consider models within the same family.
We observe no correlation between rankings of models across different families.
arXiv Detail & Related papers (2020-02-17T20:13:35Z) - Consistency of a Recurrent Language Model With Respect to Incomplete
Decoding [67.54760086239514]
We study the issue of receiving infinite-length sequences from a recurrent language model.
We propose two remedies which address inconsistency: consistent variants of top-k and nucleus sampling, and a self-terminating recurrent language model.
arXiv Detail & Related papers (2020-02-06T19:56:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.