Accurate, yet inconsistent? Consistency Analysis on Language
Understanding Models
- URL: http://arxiv.org/abs/2108.06665v1
- Date: Sun, 15 Aug 2021 06:25:07 GMT
- Title: Accurate, yet inconsistent? Consistency Analysis on Language
Understanding Models
- Authors: Myeongjun Jang, Deuk Sin Kwon, Thomas Lukasiewicz
- Abstract summary: consistency refers to the capability of generating the same predictions for semantically similar contexts.
We propose a framework named consistency analysis on language understanding models (CALUM) to evaluate the model's lower-bound consistency ability.
- Score: 38.03490197822934
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Consistency, which refers to the capability of generating the same
predictions for semantically similar contexts, is a highly desirable property
for a sound language understanding model. Although recent pretrained language
models (PLMs) deliver outstanding performance in various downstream tasks, they
should exhibit consistent behaviour provided the models truly understand
language. In this paper, we propose a simple framework named consistency
analysis on language understanding models (CALUM)} to evaluate the model's
lower-bound consistency ability. Through experiments, we confirmed that current
PLMs are prone to generate inconsistent predictions even for semantically
identical inputs. We also observed that multi-task training with paraphrase
identification tasks is of benefit to improve consistency, increasing the
consistency by 13% on average.
Related papers
- CONTESTS: a Framework for Consistency Testing of Span Probabilities in Language Models [16.436592723426305]
It is unclear whether language models produce the same value for different ways of assigning joint probabilities to word spans.
Our work introduces a novel framework, ConTestS, involving statistical tests to assess score consistency across interchangeable completion and conditioning orders.
arXiv Detail & Related papers (2024-09-30T06:24:43Z) - Counterfactuals As a Means for Evaluating Faithfulness of Attribution Methods in Autoregressive Language Models [6.394084132117747]
We propose a technique that leverages counterfactual generation to evaluate the faithfulness of attribution methods for autoregressive language models.
Our technique generates fluent, in-distribution counterfactuals, making the evaluation protocol more reliable.
arXiv Detail & Related papers (2024-08-21T00:17:59Z) - Improving Language Models Meaning Understanding and Consistency by
Learning Conceptual Roles from Dictionary [65.268245109828]
Non-human-like behaviour of contemporary pre-trained language models (PLMs) is a leading cause undermining their trustworthiness.
A striking phenomenon is the generation of inconsistent predictions, which produces contradictory results.
We propose a practical approach that alleviates the inconsistent behaviour issue by improving PLM awareness.
arXiv Detail & Related papers (2023-10-24T06:15:15Z) - Explaining Language Models' Predictions with High-Impact Concepts [11.47612457613113]
We propose a complete framework for extending concept-based interpretability methods to NLP.
We optimize for features whose existence causes the output predictions to change substantially.
Our method achieves superior results on predictive impact, usability, and faithfulness compared to the baselines.
arXiv Detail & Related papers (2023-05-03T14:48:27Z) - Consistency Analysis of ChatGPT [65.268245109828]
This paper investigates the trustworthiness of ChatGPT and GPT-4 regarding logically consistent behaviour.
Our findings suggest that while both models appear to show an enhanced language understanding and reasoning ability, they still frequently fall short of generating logically consistent predictions.
arXiv Detail & Related papers (2023-03-11T01:19:01Z) - On the Compositional Generalization Gap of In-Context Learning [73.09193595292233]
We look at the gap between the in-distribution (ID) and out-of-distribution (OOD) performance of such models in semantic parsing tasks with in-context learning.
We evaluate four model families, OPT, BLOOM, CodeGen and Codex on three semantic parsing datasets.
arXiv Detail & Related papers (2022-11-15T19:56:37Z) - A Generative Language Model for Few-shot Aspect-Based Sentiment Analysis [90.24921443175514]
We focus on aspect-based sentiment analysis, which involves extracting aspect term, category, and predicting their corresponding polarities.
We propose to reformulate the extraction and prediction tasks into the sequence generation task, using a generative language model with unidirectional attention.
Our approach outperforms the previous state-of-the-art (based on BERT) on average performance by a large margins in few-shot and full-shot settings.
arXiv Detail & Related papers (2022-04-11T18:31:53Z) - Masked Language Modeling and the Distributional Hypothesis: Order Word
Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines.
In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics.
Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z) - Measuring and Improving Consistency in Pretrained Language Models [40.46184998481918]
We study the question: Are Pretrained Language Models (PLMs) consistent with respect to factual knowledge?
Using ParaRel, we show that the consistency of all PLMs we experiment with is poor -- though with high variance between relations.
arXiv Detail & Related papers (2021-02-01T17:48:42Z) - Wisdom of the Ensemble: Improving Consistency of Deep Learning Models [11.230300336108018]
Trust is often a function of constant behavior.
This paper studies a model behavior in the context of periodic retraining of deployed models.
We prove that consistency and correct-consistency of an ensemble learner is not less than the average consistency and correct-consistency of individual learners.
arXiv Detail & Related papers (2020-11-13T07:47:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.