Related papers: Compositional generalization in semantic parsing with pretrained transformers

Compositional generalization in semantic parsing with pretrained transformers

URL: http://arxiv.org/abs/2109.15101v1
Date: Thu, 30 Sep 2021 13:06:29 GMT
Title: Compositional generalization in semantic parsing with pretrained transformers
Authors: A. Emin Orhan
Abstract summary: We show that language models pretrained exclusively with non-English corpora, or even with programming language corpora, significantly improve out-of-distribution generalization. We also show that larger models are harder to train from scratch and their generalization accuracy is lower when trained up to convergence.
Score: 13.198689566654108
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large-scale pretraining instills large amounts of knowledge in deep neural networks. This, in turn, improves the generalization behavior of these models in downstream tasks. What exactly are the limits to the generalization benefits of large-scale pretraining? Here, we report observations from some simple experiments aimed at addressing this question in the context of two semantic parsing tasks involving natural language, SCAN and COGS. We show that language models pretrained exclusively with non-English corpora, or even with programming language corpora, significantly improve out-of-distribution generalization in these benchmarks, compared with models trained from scratch, even though both benchmarks are English-based. This demonstrates the surprisingly broad transferability of pretrained representations and knowledge. Pretraining with a large-scale protein sequence prediction task, on the other hand, mostly deteriorates the generalization performance in SCAN and COGS, suggesting that pretrained representations do not transfer universally and that there are constraints on the similarity between the pretraining and downstream domains for successful transfer. Finally, we show that larger models are harder to train from scratch and their generalization accuracy is lower when trained up to convergence on the relatively small SCAN and COGS datasets, but the benefits of large-scale pretraining become much clearer with larger models.

Related papers

Overtrained Language Models Are Harder to Fine-Tune [64.44743256512237]
Large language models are pre-trained on ever-growing token budgets. We show that extended pre-training can make models harder to fine-tune, leading to degraded final performance.
arXiv Detail & Related papers (2025-03-24T23:11:56Z)
Bayes' Power for Explaining In-Context Learning Generalizations [46.17844703369127]
In this paper, we argue that a more useful interpretation of neural network behavior in this era is as an approximation of the true posterior. We show how models become robust in-context learners by effectively composing knowledge from their training data.
arXiv Detail & Related papers (2024-10-02T14:01:34Z)
TWINS: A Fine-Tuning Framework for Improved Transferability of Adversarial Robustness and Generalization [89.54947228958494]
This paper focuses on the fine-tuning of an adversarially pre-trained model in various classification tasks. We propose a novel statistics-based approach, Two-WIng NormliSation (TWINS) fine-tuning framework. TWINS is shown to be effective on a wide range of image classification datasets in terms of both generalization and robustness.
arXiv Detail & Related papers (2023-03-20T14:12:55Z)
Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains. How do language models of different sizes learn during pre-training? Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z)
Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models [46.24479693469042]
This paper shows that 1) pre-training loss cannot fully explain downstream performance and 2) flatness of the model is well-correlated with downstream performance where pre-training loss is not.
arXiv Detail & Related papers (2022-10-25T17:45:36Z)
Learning to Generalize to More: Continuous Semantic Augmentation for Neural Machine Translation [50.54059385277964]
We present a novel data augmentation paradigm termed Continuous Semantic Augmentation (CsaNMT) CsaNMT augments each training instance with an adjacency region that could cover adequate variants of literal expression under the same meaning.
arXiv Detail & Related papers (2022-04-14T08:16:28Z)
Improving BERT Pretraining with Syntactic Supervision [2.4087148947930634]
Bidirectional masked Transformers have become the core theme in the current NLP landscape. We apply our methodology on Lassy Large, an automatically annotated corpus of written Dutch. Our experiments suggest that our syntax-aware model performs on par with established baselines.
arXiv Detail & Related papers (2021-04-21T13:15:58Z)
The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models [115.49214555402567]
Pre-trained weights often boost a wide range of downstream tasks including classification, detection, and segmentation. Recent studies suggest that pre-training benefits from gigantic model capacity. In this paper, we examine supervised and self-supervised pre-trained models through the lens of the lottery ticket hypothesis (LTH)
arXiv Detail & Related papers (2020-12-12T21:53:55Z)
Text Classification with Few Examples using Controlled Generalization [58.971750512415134]
Current practice relies on pre-trained word embeddings to map words unseen in training to similar seen ones. Our alternative begins with sparse pre-trained representations derived from unlabeled parsed corpora. We show that a feed-forward network over these vectors is especially effective in low-data scenarios.
arXiv Detail & Related papers (2020-05-18T06:04:58Z)
Adversarial Training for Large Neural Language Models [107.84290922621163]
We show that adversarial pre-training can improve both generalization and robustness. ALUM regularizes the training objective by applying perturbations in the embedding space that maximizes the adversarial loss. ALUM can be further combined with task-specific fine-tuning to attain additional gains.
arXiv Detail & Related papers (2020-04-20T00:07:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.