Enhancing Self-Consistency and Performance of Pre-Trained Language
Models through Natural Language Inference
- URL: http://arxiv.org/abs/2211.11875v1
- Date: Mon, 21 Nov 2022 21:58:30 GMT
- Title: Enhancing Self-Consistency and Performance of Pre-Trained Language
Models through Natural Language Inference
- Authors: Eric Mitchell, Joseph J. Noh, Siyan Li, William S. Armstrong, Ananth
Agarwal, Patrick Liu, Chelsea Finn, Christopher D. Manning
- Abstract summary: Large pre-trained language models often lack logical consistency across test inputs.
We propose a framework, ConCoRD, for boosting the consistency and accuracy of pre-trained NLP models.
We show that ConCoRD consistently boosts accuracy and consistency of off-the-shelf closed-book QA and VQA models.
- Score: 72.61732440246954
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While large pre-trained language models are powerful, their predictions often
lack logical consistency across test inputs. For example, a state-of-the-art
Macaw question-answering (QA) model answers 'Yes' to 'Is a sparrow a bird?' and
'Does a bird have feet?' but answers 'No' to 'Does a sparrow have feet?'. To
address this failure mode, we propose a framework, Consistency Correction
through Relation Detection, or ConCoRD, for boosting the consistency and
accuracy of pre-trained NLP models using pre-trained natural language inference
(NLI) models without fine-tuning or re-training. Given a batch of test inputs,
ConCoRD samples several candidate outputs for each input and instantiates a
factor graph that accounts for both the model's belief about the likelihood of
each answer choice in isolation and the NLI model's beliefs about pair-wise
answer choice compatibility. We show that a weighted MaxSAT solver can
efficiently compute high-quality answer choices under this factor graph,
improving over the raw model's predictions. Our experiments demonstrate that
ConCoRD consistently boosts accuracy and consistency of off-the-shelf
closed-book QA and VQA models using off-the-shelf NLI models, notably
increasing accuracy of LXMERT on ConVQA by 5% absolute. See
https://ericmitchell.ai/emnlp-2022-concord/ for code and data.
Related papers
- Self-Consistency of Large Language Models under Ambiguity [4.141513298907867]
This work presents an evaluation benchmark for self-consistency in cases of under-specification.
We conduct a series of behavioral experiments on the OpenAI model suite using an ambiguous integer sequence completion task.
We find that average consistency ranges from 67% to 82%, far higher than would be predicted if a model's consistency was random.
arXiv Detail & Related papers (2023-10-20T11:57:56Z) - Learn What Is Possible, Then Choose What Is Best: Disentangling
One-To-Many Relations in Language Through Text-based Games [3.615981646205045]
We present an approach to train language models that can emulate the desirable behaviours, but not the undesirable ones.
Using text-based games as a testbed, our approach, PASA, uses discrete latent variables to capture the range of different behaviours.
Results show up to 49% empirical improvement over the previous state-of-the-art model.
arXiv Detail & Related papers (2023-04-14T17:11:26Z) - Realistic Conversational Question Answering with Answer Selection based
on Calibrated Confidence and Uncertainty Measurement [54.55643652781891]
Conversational Question Answering (ConvQA) models aim at answering a question with its relevant paragraph and previous question-answer pairs that occurred during conversation multiple times.
We propose to filter out inaccurate answers in the conversation history based on their estimated confidences and uncertainties from the ConvQA model.
We validate our models, Answer Selection-based realistic Conversation Question Answering, on two standard ConvQA datasets.
arXiv Detail & Related papers (2023-02-10T09:42:07Z) - Language Models (Mostly) Know What They Know [10.836210010868932]
We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly.
We investigate whether models can be trained to predict "P(IK)", the probability that "I know" the answer to a question, without reference to any particular proposed answer.
arXiv Detail & Related papers (2022-07-11T22:59:39Z) - Embarrassingly Simple Performance Prediction for Abductive Natural
Language Inference [10.536415845097661]
We propose a method for predicting the performance of NLI models without fine-tuning them.
We show that the accuracy of the cosine similarity approach correlates strongly with the accuracy of the classification approach with a Pearson correlation coefficient of 0.65.
Our method can lead to significant time savings in the process of model selection.
arXiv Detail & Related papers (2022-02-21T18:10:24Z) - Probabilistic Graph Reasoning for Natural Proof Generation [22.1374469158861]
We propose PRobr, a novel approach for joint answer prediction and proof generation.
PRobr defines a joint probabilistic distribution over all possible proof graphs and answers.
Experiments on multiple datasets verify the effectiveness of PRobr.
arXiv Detail & Related papers (2021-07-06T06:34:41Z) - Learning to Perturb Word Embeddings for Out-of-distribution QA [55.103586220757464]
We propose a simple yet effective DA method based on a noise generator, which learns to perturb the word embedding of the input questions and context without changing their semantics.
We validate the performance of the QA models trained with our word embedding on a single source dataset, on five different target domains.
Notably, the model trained with ours outperforms the model trained with more than 240K artificially generated QA pairs.
arXiv Detail & Related papers (2021-05-06T14:12:26Z) - How Can We Know When Language Models Know? On the Calibration of
Language Models for Question Answering [80.82194311274694]
We examine the question "how can we know when language models know, with confidence, the answer to a particular query?"
We examine three strong generative models -- T5, BART, and GPT-2 -- and study whether their probabilities on QA tasks are well calibrated.
We then examine methods to calibrate such models to make their confidence scores correlate better with the likelihood of correctness.
arXiv Detail & Related papers (2020-12-02T03:53:13Z) - Contrast and Classify: Training Robust VQA Models [60.80627814762071]
We propose a novel training paradigm (ConClaT) that optimize both cross-entropy and contrastive losses.
We find that optimizing both losses -- either alternately or jointly -- is key to effective training.
arXiv Detail & Related papers (2020-10-13T00:23:59Z) - Counterfactual Variable Control for Robust and Interpretable Question
Answering [57.25261576239862]
Deep neural network based question answering (QA) models are neither robust nor explainable in many cases.
In this paper, we inspect such spurious "capability" of QA models using causal inference.
We propose a novel approach called Counterfactual Variable Control (CVC) that explicitly mitigates any shortcut correlation.
arXiv Detail & Related papers (2020-10-12T10:09:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.