Do Fine-tuned Commonsense Language Models Really Generalize?
- URL: http://arxiv.org/abs/2011.09159v1
- Date: Wed, 18 Nov 2020 08:52:49 GMT
- Title: Do Fine-tuned Commonsense Language Models Really Generalize?
- Authors: Mayank Kejriwal and Ke Shen
- Abstract summary: We study the generalization issue in detail by designing and conducting a rigorous scientific study.
We find clear evidence that fine-tuned commonsense language models still do not generalize well, even with moderate changes to the experimental setup.
- Score: 8.591839265985412
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, transformer-based methods such as RoBERTa and GPT-3 have led to
significant experimental advances in natural language processing tasks such as
question answering and commonsense reasoning. The latter is typically evaluated
through multiple benchmarks framed as multiple-choice instances of the former.
According to influential leaderboards hosted by the Allen Institute (evaluating
state-of-the-art performance on commonsense reasoning benchmarks), models based
on such transformer methods are approaching human-like performance and have
average accuracy well over 80% on many benchmarks. Since these are commonsense
benchmarks, a model that generalizes on commonsense reasoning should not
experience much performance loss across multiple commonsense benchmarks. In
this paper, we study the generalization issue in detail by designing and
conducting a rigorous scientific study. Using five common benchmarks, multiple
controls and statistical analysis, we find clear evidence that fine-tuned
commonsense language models still do not generalize well, even with moderate
changes to the experimental setup, and may, in fact, be susceptible to dataset
bias. We also perform selective studies, including qualitative and consistency
analyses, to gain deeper insight into the problem.
Related papers
- Can we hop in general? A discussion of benchmark selection and design using the Hopper environment [12.18012293738896]
We argue that benchmarking in reinforcement learning needs to be treated as a scientific discipline itself.
Case study shows that the selection of standard benchmarking suites can drastically change how we judge performance of algorithms.
arXiv Detail & Related papers (2024-10-11T14:47:22Z) - Using Counterfactual Tasks to Evaluate the Generality of Analogical
Reasoning in Large Language Models [7.779982757267302]
We investigate the generality of analogy-making abilities previously claimed for large language models (LLMs)
We show that while the performance of humans remains high for all the problems, the GPT models' performance declines sharply on the counterfactual set.
arXiv Detail & Related papers (2024-02-14T05:52:23Z) - Preserving Knowledge Invariance: Rethinking Robustness Evaluation of
Open Information Extraction [50.62245481416744]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world.
We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique.
By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z) - Benchmarks for Automated Commonsense Reasoning: A Survey [0.0]
More than one hundred benchmarks have been developed to test the commonsense knowledge and commonsense reasoning abilities of AI systems.
This paper surveys the development and uses of AI commonsense benchmarks.
arXiv Detail & Related papers (2023-02-09T16:34:30Z) - Predicting Out-of-Domain Generalization with Neighborhood Invariance [59.05399533508682]
We propose a measure of a classifier's output invariance in a local transformation neighborhood.
Our measure is simple to calculate, does not depend on the test point's true label, and can be applied even in out-of-domain (OOD) settings.
In experiments on benchmarks in image classification, sentiment analysis, and natural language inference, we demonstrate a strong and robust correlation between our measure and actual OOD generalization.
arXiv Detail & Related papers (2022-07-05T14:55:16Z) - What do Toothbrushes do in the Kitchen? How Transformers Think our World
is Structured [137.83584233680116]
We investigate what extent transformer-based language models allow for extracting knowledge about object relations.
We show that the models combined with the different similarity measures differ greatly in terms of the amount of knowledge they allow for extracting.
Surprisingly, static models perform almost as well as contextualized models -- in some cases even better.
arXiv Detail & Related papers (2022-04-12T10:00:20Z) - A Theoretically Grounded Benchmark for Evaluating Machine Commonsense [6.725087407394836]
Theoretically-answered Commonsense Reasoning (TG-CSR) is based on discriminative question answering, but with questions designed to evaluate diverse aspects of commonsense.
TG-CSR is based on a subset of commonsense categories first proposed as a viable theory of commonsense by Gordon and Hobbs.
Preliminary results suggest that the benchmark is challenging even for advanced language representation models designed for discriminative CSR question answering tasks.
arXiv Detail & Related papers (2022-03-23T04:06:01Z) - General Greedy De-bias Learning [163.65789778416172]
We propose a General Greedy De-bias learning framework (GGD), which greedily trains the biased models and the base model like gradient descent in functional space.
GGD can learn a more robust base model under the settings of both task-specific biased models with prior knowledge and self-ensemble biased model without prior knowledge.
arXiv Detail & Related papers (2021-12-20T14:47:32Z) - Exploring Strategies for Generalizable Commonsense Reasoning with
Pre-trained Models [62.28551903638434]
We measure the impact of three different adaptation methods on the generalization and accuracy of models.
Experiments with two models show that fine-tuning performs best, by learning both the content and the structure of the task, but suffers from overfitting and limited generalization to novel answers.
We observe that alternative adaptation methods like prefix-tuning have comparable accuracy, but generalize better to unseen answers and are more robust to adversarial splits.
arXiv Detail & Related papers (2021-09-07T03:13:06Z) - COM2SENSE: A Commonsense Reasoning Benchmark with Complementary
Sentences [21.11065466376105]
Commonsense reasoning is intuitive for humans but has been a long-term challenge for artificial intelligence (AI)
Recent advancements in pretrained language models have shown promising results on several commonsense benchmark datasets.
We introduce a new commonsense reasoning benchmark dataset comprising natural language true/false statements.
arXiv Detail & Related papers (2021-06-02T06:31:55Z) - Improving QA Generalization by Concurrent Modeling of Multiple Biases [61.597362592536896]
Existing NLP datasets contain various biases that models can easily exploit to achieve high performances on the corresponding evaluation sets.
We propose a general framework for improving the performance on both in-domain and out-of-domain datasets by concurrent modeling of multiple biases in the training data.
We extensively evaluate our framework on extractive question answering with training data from various domains with multiple biases of different strengths.
arXiv Detail & Related papers (2020-10-07T11:18:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.