Uncontrolled Lexical Exposure Leads to Overestimation of Compositional
Generalization in Pretrained Models
- URL: http://arxiv.org/abs/2212.10769v1
- Date: Wed, 21 Dec 2022 05:02:08 GMT
- Title: Uncontrolled Lexical Exposure Leads to Overestimation of Compositional
Generalization in Pretrained Models
- Authors: Najoung Kim, Tal Linzen, Paul Smolensky
- Abstract summary: We argue that exposure to pretraining data may break distributional control.
We find that both of these setups lead to lower generalization performance in T5.
- Score: 31.573015421633155
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Human linguistic capacity is often characterized by compositionality and the
generalization it enables -- human learners can produce and comprehend novel
complex expressions by composing known parts. Several benchmarks exploit
distributional control across training and test to gauge compositional
generalization, where certain lexical items only occur in limited contexts
during training. While recent work using these benchmarks suggests that
pretrained models achieve impressive generalization performance, we argue that
exposure to pretraining data may break the aforementioned distributional
control. Using the COGS benchmark of Kim and Linzen (2020), we test two
modified evaluation setups that control for this issue: (1) substituting
context-controlled lexical items with novel character sequences, and (2)
substituting them with special tokens represented by novel embeddings. We find
that both of these setups lead to lower generalization performance in T5
(Raffel et al., 2020), suggesting that previously reported results have been
overestimated due to uncontrolled lexical exposure during pretraining. The
performance degradation is more extreme with novel embeddings, and the
degradation increases with the amount of pretraining data, highlighting an
interesting case of inverse scaling.
Related papers
- On the Generalization Ability of Unsupervised Pretraining [53.06175754026037]
Recent advances in unsupervised learning have shown that unsupervised pre-training, followed by fine-tuning, can improve model generalization.
This paper introduces a novel theoretical framework that illuminates the critical factor influencing the transferability of knowledge acquired during unsupervised pre-training to the subsequent fine-tuning phase.
Our results contribute to a better understanding of unsupervised pre-training and fine-tuning paradigm, and can shed light on the design of more effective pre-training algorithms.
arXiv Detail & Related papers (2024-03-11T16:23:42Z) - Lexical Repetitions Lead to Rote Learning: Unveiling the Impact of
Lexical Overlap in Train and Test Reference Summaries [131.80860903537172]
Ideal summarization models should generalize to novel summary-worthy content without remembering reference training summaries by rote.
We propose a fine-grained evaluation protocol by partitioning a test set based on the lexical similarity of reference test summaries with training summaries.
arXiv Detail & Related papers (2023-11-15T23:47:53Z) - Enhancing Supervised Learning with Contrastive Markings in Neural
Machine Translation Training [10.498938255717066]
Supervised learning in Neural Machine Translation (NMT) typically follows a teacher forcing paradigm.
We present a simple extension of standard maximum likelihood estimation by a contrastive marking objective.
We show that training with contrastive markings yields improvements on top of supervised learning.
arXiv Detail & Related papers (2023-07-17T11:56:32Z) - TWINS: A Fine-Tuning Framework for Improved Transferability of
Adversarial Robustness and Generalization [89.54947228958494]
This paper focuses on the fine-tuning of an adversarially pre-trained model in various classification tasks.
We propose a novel statistics-based approach, Two-WIng NormliSation (TWINS) fine-tuning framework.
TWINS is shown to be effective on a wide range of image classification datasets in terms of both generalization and robustness.
arXiv Detail & Related papers (2023-03-20T14:12:55Z) - Dynamic Scheduled Sampling with Imitation Loss for Neural Text
Generation [10.306522595622651]
We introduce Dynamic Scheduled Sampling with Imitation Loss (DySI), which maintains the schedule based solely on the training time accuracy.
DySI achieves notable improvements on standard machine translation benchmarks, and significantly improves the robustness of other text generation models.
arXiv Detail & Related papers (2023-01-31T16:41:06Z) - Mutual Exclusivity Training and Primitive Augmentation to Induce
Compositionality [84.94877848357896]
Recent datasets expose the lack of the systematic generalization ability in standard sequence-to-sequence models.
We analyze this behavior of seq2seq models and identify two contributing factors: a lack of mutual exclusivity bias and the tendency to memorize whole examples.
We show substantial empirical improvements using standard sequence-to-sequence models on two widely-used compositionality datasets.
arXiv Detail & Related papers (2022-11-28T17:36:41Z) - Categorizing Semantic Representations for Neural Machine Translation [53.88794787958174]
We introduce categorization to the source contextualized representations.
The main idea is to enhance generalization by reducing sparsity and overfitting.
Experiments on a dedicated MT dataset show that our method reduces compositional generalization error rates by 24% error reduction.
arXiv Detail & Related papers (2022-10-13T04:07:08Z) - Revisiting the Compositional Generalization Abilities of Neural Sequence
Models [23.665350744415004]
We focus on one-shot primitive generalization as introduced by the popular SCAN benchmark.
We demonstrate that modifying the training distribution in simple and intuitive ways enables standard seq-to-seq models to achieve near-perfect generalization performance.
arXiv Detail & Related papers (2022-03-14T18:03:21Z) - Compositional generalization in semantic parsing with pretrained
transformers [13.198689566654108]
We show that language models pretrained exclusively with non-English corpora, or even with programming language corpora, significantly improve out-of-distribution generalization.
We also show that larger models are harder to train from scratch and their generalization accuracy is lower when trained up to convergence.
arXiv Detail & Related papers (2021-09-30T13:06:29Z) - Avoiding Inference Heuristics in Few-shot Prompt-based Finetuning [57.4036085386653]
We show that prompt-based models for sentence pair classification tasks still suffer from a common pitfall of adopting inferences based on lexical overlap.
We then show that adding a regularization that preserves pretraining weights is effective in mitigating this destructive tendency of few-shot finetuning.
arXiv Detail & Related papers (2021-09-09T10:10:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.