Self-Consistency of Large Language Models under Ambiguity
- URL: http://arxiv.org/abs/2310.13439v1
- Date: Fri, 20 Oct 2023 11:57:56 GMT
- Title: Self-Consistency of Large Language Models under Ambiguity
- Authors: Henning Bartsch, Ole Jorgensen, Domenic Rosati, Jason
Hoelscher-Obermaier, Jacob Pfau
- Abstract summary: This work presents an evaluation benchmark for self-consistency in cases of under-specification.
We conduct a series of behavioral experiments on the OpenAI model suite using an ambiguous integer sequence completion task.
We find that average consistency ranges from 67% to 82%, far higher than would be predicted if a model's consistency was random.
- Score: 4.141513298907867
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) that do not give consistent answers across
contexts are problematic when used for tasks with expectations of consistency,
e.g., question-answering, explanations, etc. Our work presents an evaluation
benchmark for self-consistency in cases of under-specification where two or
more answers can be correct. We conduct a series of behavioral experiments on
the OpenAI model suite using an ambiguous integer sequence completion task. We
find that average consistency ranges from 67\% to 82\%, far higher than would
be predicted if a model's consistency was random, and increases as model
capability improves. Furthermore, we show that models tend to maintain
self-consistency across a series of robustness checks, including prompting
speaker changes and sequence length changes. These results suggest that
self-consistency arises as an emergent capability without specifically training
for it. Despite this, we find that models are uncalibrated when judging their
own consistency, with models displaying both over- and under-confidence. We
also propose a nonparametric test for determining from token output
distribution whether a model assigns non-trivial probability to alternative
answers. Using this test, we find that despite increases in self-consistency,
models usually place significant weight on alternative, inconsistent answers.
This distribution of probability mass provides evidence that even highly
self-consistent models internally compute multiple possible responses.
Related papers
- Independence Tests for Language Models [47.0749292650885]
Given the weights of two models, can we test whether they were trained independently?
We consider two settings: constrained and unconstrained.
We propose a new test which matches hidden activations between two models, and which is robust to adversarial transformations and to changes in model architecture.
arXiv Detail & Related papers (2025-02-17T20:01:08Z) - DiverseAgentEntropy: Quantifying Black-Box LLM Uncertainty through Diverse Perspectives and Multi-Agent Interaction [53.803276766404494]
Existing methods, which gauge a model's uncertainty through evaluating self-consistency in responses to the original query, do not always capture true uncertainty.
We propose a novel method, DiverseAgentEntropy, for evaluating a model's uncertainty using multi-agent interaction.
Our method offers a more accurate prediction of the model's reliability and further detects hallucinations, outperforming other self-consistency-based methods.
arXiv Detail & Related papers (2024-12-12T18:52:40Z) - CONTESTS: a Framework for Consistency Testing of Span Probabilities in Language Models [16.436592723426305]
It is unclear whether language models produce the same value for different ways of assigning joint probabilities to word spans.
Our work introduces a novel framework, ConTestS, involving statistical tests to assess score consistency across interchangeable completion and conditioning orders.
arXiv Detail & Related papers (2024-09-30T06:24:43Z) - Uncertainty-aware Language Modeling for Selective Question Answering [107.47864420630923]
We present an automatic large language model (LLM) conversion approach that produces uncertainty-aware LLMs.
Our approach is model- and data-agnostic, is computationally-efficient, and does not rely on external models or systems.
arXiv Detail & Related papers (2023-11-26T22:47:54Z) - Calibrating Likelihoods towards Consistency in Summarization Models [22.023863165579602]
We argue that the main reason for such behavior is that the summarization models trained with maximum likelihood objective assign high probability to plausible sequences given the context.
In this work, we solve this problem by calibrating the likelihood of model generated sequences to better align with a consistency metric measured by natural language inference (NLI) models.
arXiv Detail & Related papers (2023-10-12T23:17:56Z) - Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs [78.31625291513589]
We argue that self-consistency is an important criteria for valid multi-step reasoning in tasks where the solution is composed of the answers to multiple sub-steps.
We propose two types of self-consistency that are particularly important for multi-step reasoning -- hypothetical consistency and compositional consistency.
We demonstrate that multiple variants of the GPT-3/-4 models exhibit poor consistency rates across both types of consistency on a variety of tasks.
arXiv Detail & Related papers (2023-05-23T17:25:59Z) - Realistic Conversational Question Answering with Answer Selection based
on Calibrated Confidence and Uncertainty Measurement [54.55643652781891]
Conversational Question Answering (ConvQA) models aim at answering a question with its relevant paragraph and previous question-answer pairs that occurred during conversation multiple times.
We propose to filter out inaccurate answers in the conversation history based on their estimated confidences and uncertainties from the ConvQA model.
We validate our models, Answer Selection-based realistic Conversation Question Answering, on two standard ConvQA datasets.
arXiv Detail & Related papers (2023-02-10T09:42:07Z) - Sharing pattern submodels for prediction with missing values [12.981974894538668]
Missing values are unavoidable in many applications of machine learning and present challenges both during training and at test time.
We propose an alternative approach, called sharing pattern submodels, which i) makes predictions robust to missing values at test time, ii) maintains or improves the predictive power of pattern submodels andiii) has a short description, enabling improved interpretability.
arXiv Detail & Related papers (2022-06-22T15:09:40Z) - Anomaly Detection of Time Series with Smoothness-Inducing Sequential
Variational Auto-Encoder [59.69303945834122]
We present a Smoothness-Inducing Sequential Variational Auto-Encoder (SISVAE) model for robust estimation and anomaly detection of time series.
Our model parameterizes mean and variance for each time-stamp with flexible neural networks.
We show the effectiveness of our model on both synthetic datasets and public real-world benchmarks.
arXiv Detail & Related papers (2021-02-02T06:15:15Z) - Wisdom of the Ensemble: Improving Consistency of Deep Learning Models [11.230300336108018]
Trust is often a function of constant behavior.
This paper studies a model behavior in the context of periodic retraining of deployed models.
We prove that consistency and correct-consistency of an ensemble learner is not less than the average consistency and correct-consistency of individual learners.
arXiv Detail & Related papers (2020-11-13T07:47:01Z) - On the Discrepancy between Density Estimation and Sequence Generation [92.70116082182076]
log-likelihood is highly correlated with BLEU when we consider models within the same family.
We observe no correlation between rankings of models across different families.
arXiv Detail & Related papers (2020-02-17T20:13:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.