Data Factors for Better Compositional Generalization
- URL: http://arxiv.org/abs/2311.04420v1
- Date: Wed, 8 Nov 2023 01:27:34 GMT
- Title: Data Factors for Better Compositional Generalization
- Authors: Xiang Zhou, Yichen Jiang, Mohit Bansal
- Abstract summary: We conduct an empirical analysis by training Transformer models on a variety of training sets with different data factors.
We show that increased dataset complexity can lead to better generalization behavior on multiple different generalization challenges.
We explore how training examples of different difficulty levels influence generalization differently.
- Score: 60.698130703909804
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent diagnostic datasets on compositional generalization, such as SCAN
(Lake and Baroni, 2018) and COGS (Kim and Linzen, 2020), expose severe problems
in models trained from scratch on these datasets. However, in contrast to this
poor performance, state-of-the-art models trained on larger and more general
datasets show better generalization ability. In this work, to reconcile this
inconsistency, we conduct an empirical analysis by training Transformer models
on a variety of training sets with different data factors, including dataset
scale, pattern complexity, example difficulty, etc. First, we show that
increased dataset complexity can lead to better generalization behavior on
multiple different generalization challenges. To further understand this
improvement, we show two axes of the benefit from more complex datasets: they
provide more diverse examples so compositional understanding becomes more
effective, and they also prevent ungeneralizable memorization of the examples
due to reduced example repetition frequency. Finally, we explore how training
examples of different difficulty levels influence generalization differently.
On synthetic datasets, simple examples invoke stronger compositionality than
hard examples do. On larger-scale real language datasets, while hard examples
become more important potentially to ensure decent data coverage, a balanced
mixture of simple and hard examples manages to induce the strongest
generalizability. The code and data for this work are available at
https://github.com/owenzx/data4comp
Related papers
- Easy2Hard-Bench: Standardized Difficulty Labels for Profiling LLM Performance and Generalization [126.27645170941268]
We present Easy2Hard-Bench, a collection of 6 benchmark datasets spanning various domains.
Each problem within these datasets is annotated with numerical difficulty scores.
We provide a comprehensive analysis of their performance and generalization capabilities across varying levels of difficulty.
arXiv Detail & Related papers (2024-09-27T03:49:56Z) - Towards Understanding the Relationship between In-context Learning and Compositional Generalization [7.843029855730508]
We train a causal Transformer in a setting that renders ordinary learning very difficult.
The model can solve the task, however, by utilizing earlier examples to generalize to later ones.
In evaluations on the datasets, SCAN, COGS, and GeoQuery, models trained in this manner indeed show improved compositional generalization.
arXiv Detail & Related papers (2024-03-18T14:45:52Z) - The Unreasonable Effectiveness of Easy Training Data for Hard Tasks [84.30018805150607]
We present the surprising conclusion that current pretrained language models often generalize relatively well from easy to hard data.
We demonstrate this kind of easy-to-hard generalization using simple finetuning methods like in-context learning, linear heads, and QLoRA.
We conclude that easy-to-hard generalization in LMs is surprisingly strong for the tasks studied.
arXiv Detail & Related papers (2024-01-12T18:36:29Z) - Simplicity Bias Leads to Amplified Performance Disparities [8.60453031364566]
We show that SGD-trained models have a bias towards simplicity, leading them to prioritize learning a majority class.
A model may prioritize any class or group of the dataset that it finds simple-at the expense of what it finds complex.
arXiv Detail & Related papers (2022-12-13T15:24:41Z) - Evading the Simplicity Bias: Training a Diverse Set of Models Discovers
Solutions with Superior OOD Generalization [93.8373619657239]
Neural networks trained with SGD were recently shown to rely preferentially on linearly-predictive features.
This simplicity bias can explain their lack of robustness out of distribution (OOD)
We demonstrate that the simplicity bias can be mitigated and OOD generalization improved.
arXiv Detail & Related papers (2021-05-12T12:12:24Z) - On the Pitfalls of Learning with Limited Data: A Facial Expression
Recognition Case Study [0.5249805590164901]
We focus on the problem of Facial Expression Recognition from videos.
We performed an extensive study with four databases at a different complexity and nine deep-learning architectures for video classification.
We found that complex training sets translate better to more stable test sets when trained with transfer learning and synthetically generated data.
arXiv Detail & Related papers (2021-04-02T18:53:41Z) - Learning What Makes a Difference from Counterfactual Examples and
Gradient Supervision [57.14468881854616]
We propose an auxiliary training objective that improves the generalization capabilities of neural networks.
We use pairs of minimally-different examples with different labels, a.k.a counterfactual or contrasting examples, which provide a signal indicative of the underlying causal structure of the task.
Models trained with this technique demonstrate improved performance on out-of-distribution test sets.
arXiv Detail & Related papers (2020-04-20T02:47:49Z) - A Close Look at Deep Learning with Small Data [0.0]
We show that model complexity is a critical factor when only a few samples per class are available.
We also show that even standard data augmentation can boost recognition performance by large margins.
arXiv Detail & Related papers (2020-03-28T17:11:29Z) - Robust and On-the-fly Dataset Denoising for Image Classification [72.10311040730815]
On-the-fly Data Denoising (ODD) is robust to mislabeled examples, while introducing almost zero computational overhead compared to standard training.
ODD is able to achieve state-of-the-art results on a wide range of datasets including real-world ones such as WebVision and Clothing1M.
arXiv Detail & Related papers (2020-03-24T03:59:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.