Measuring and Improving Consistency in Pretrained Language Models
- URL: http://arxiv.org/abs/2102.01017v1
- Date: Mon, 1 Feb 2021 17:48:42 GMT
- Title: Measuring and Improving Consistency in Pretrained Language Models
- Authors: Yanai Elazar, Nora Kassner, Shauli Ravfogel, Abhilasha Ravichander,
Eduard Hovy, Hinrich Sch\"utze, Yoav Goldberg
- Abstract summary: We study the question: Are Pretrained Language Models (PLMs) consistent with respect to factual knowledge?
Using ParaRel, we show that the consistency of all PLMs we experiment with is poor -- though with high variance between relations.
- Score: 40.46184998481918
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Consistency of a model -- that is, the invariance of its behavior under
meaning-preserving alternations in its input -- is a highly desirable property
in natural language processing. In this paper we study the question: Are
Pretrained Language Models (PLMs) consistent with respect to factual knowledge?
To this end, we create ParaRel, a high-quality resource of cloze-style query
English paraphrases. It contains a total of 328 paraphrases for thirty-eight
relations. Using ParaRel, we show that the consistency of all PLMs we
experiment with is poor -- though with high variance between relations. Our
analysis of the representational spaces of PLMs suggests that they have a poor
structure and are currently not suitable for representing knowledge in a robust
way. Finally, we propose a method for improving model consistency and
experimentally demonstrate its effectiveness.
Related papers
- How often are errors in natural language reasoning due to paraphrastic variability? [29.079188032623605]
We propose a metric for evaluating the paraphrastic consistency of natural language reasoning models.
We mathematically connect this metric to the proportion of a model's variance in correctness attributable to paraphrasing.
We collect ParaNLU, a dataset of 7,782 human-written and validated paraphrased reasoning problems.
arXiv Detail & Related papers (2024-04-17T20:11:32Z) - Improving Language Models Meaning Understanding and Consistency by
Learning Conceptual Roles from Dictionary [65.268245109828]
Non-human-like behaviour of contemporary pre-trained language models (PLMs) is a leading cause undermining their trustworthiness.
A striking phenomenon is the generation of inconsistent predictions, which produces contradictory results.
We propose a practical approach that alleviates the inconsistent behaviour issue by improving PLM awareness.
arXiv Detail & Related papers (2023-10-24T06:15:15Z) - Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds [59.71218039095155]
We evaluate language understanding capacities on simple inference tasks that most humans find trivial.
We target (i) grammatically-specified entailments, (ii) premises with evidential adverbs of uncertainty, and (iii) monotonicity entailments.
The models exhibit moderate to low performance on these evaluation sets.
arXiv Detail & Related papers (2023-05-24T06:41:09Z) - Preserving Knowledge Invariance: Rethinking Robustness Evaluation of
Open Information Extraction [50.62245481416744]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world.
We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique.
By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z) - Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP)
What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining.
How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z) - An Application of Pseudo-Log-Likelihoods to Natural Language Scoring [5.382454613390483]
A language model with relatively few parameters and training steps can outperform it on a recent large data set.
We produce some absolute state-of-the-art results for common sense reasoning in binary choice tasks.
We argue that robustness of the smaller model ought to be understood in terms of compositionality.
arXiv Detail & Related papers (2022-01-23T22:00:54Z) - Zero-shot Commonsense Question Answering with Cloze Translation and
Consistency Optimization [20.14487209460865]
We investigate four translation methods that can translate natural questions into cloze-style sentences.
We show that our methods are complementary datasets to a knowledge base improved model, and combining them can lead to state-of-the-art zero-shot performance.
arXiv Detail & Related papers (2022-01-01T07:12:49Z) - NoiER: An Approach for Training more Reliable Fine-TunedDownstream Task
Models [54.184609286094044]
We propose noise entropy regularisation (NoiER) as an efficient learning paradigm that solves the problem without auxiliary models and additional data.
The proposed approach improved traditional OOD detection evaluation metrics by 55% on average compared to the original fine-tuned models.
arXiv Detail & Related papers (2021-08-29T06:58:28Z) - Accurate, yet inconsistent? Consistency Analysis on Language
Understanding Models [38.03490197822934]
consistency refers to the capability of generating the same predictions for semantically similar contexts.
We propose a framework named consistency analysis on language understanding models (CALUM) to evaluate the model's lower-bound consistency ability.
arXiv Detail & Related papers (2021-08-15T06:25:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.