Related papers: Data Factors for Better Compositional Generalization

Data Factors for Better Compositional Generalization

URL: http://arxiv.org/abs/2311.04420v1
Date: Wed, 8 Nov 2023 01:27:34 GMT
Title: Data Factors for Better Compositional Generalization
Authors: Xiang Zhou, Yichen Jiang, Mohit Bansal
Abstract summary: We conduct an empirical analysis by training Transformer models on a variety of training sets with different data factors. We show that increased dataset complexity can lead to better generalization behavior on multiple different generalization challenges. We explore how training examples of different difficulty levels influence generalization differently.
Score: 60.698130703909804
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent diagnostic datasets on compositional generalization, such as SCAN (Lake and Baroni, 2018) and COGS (Kim and Linzen, 2020), expose severe problems in models trained from scratch on these datasets. However, in contrast to this poor performance, state-of-the-art models trained on larger and more general datasets show better generalization ability. In this work, to reconcile this inconsistency, we conduct an empirical analysis by training Transformer models on a variety of training sets with different data factors, including dataset scale, pattern complexity, example difficulty, etc. First, we show that increased dataset complexity can lead to better generalization behavior on multiple different generalization challenges. To further understand this improvement, we show two axes of the benefit from more complex datasets: they provide more diverse examples so compositional understanding becomes more effective, and they also prevent ungeneralizable memorization of the examples due to reduced example repetition frequency. Finally, we explore how training examples of different difficulty levels influence generalization differently. On synthetic datasets, simple examples invoke stronger compositionality than hard examples do. On larger-scale real language datasets, while hard examples become more important potentially to ensure decent data coverage, a balanced mixture of simple and hard examples manages to induce the strongest generalizability. The code and data for this work are available at https://github.com/owenzx/data4comp

Related papers

On the generalization of language models from in-context learning and finetuning: a controlled study [36.384796130439035]
We show that language models' in-context learning shows different inductive biases, and can generalize better in some cases. We propose a method to enable improved generalization from fine-tuning: adding in-context inferences to finetuning data. Our results have implications for understanding the inductive biases of different modes of learning in language models.
arXiv Detail & Related papers (2025-05-01T17:02:27Z)
Weak-to-Strong Generalization Through the Data-Centric Lens [12.221894353699918]
We propose a simple data-centric mechanism that characterizes weak-to-strong generalization: the overlap density. We present a theoretical result showing that the generalization benefit is a function of the overlap density and a regret bound for our data selection algorithm.
arXiv Detail & Related papers (2024-12-05T05:29:19Z)
Easy2Hard-Bench: Standardized Difficulty Labels for Profiling LLM Performance and Generalization [126.27645170941268]
We present Easy2Hard-Bench, a collection of 6 benchmark datasets spanning various domains. Each problem within these datasets is annotated with numerical difficulty scores. We provide a comprehensive analysis of their performance and generalization capabilities across varying levels of difficulty.
arXiv Detail & Related papers (2024-09-27T03:49:56Z)
Towards Understanding the Relationship between In-context Learning and Compositional Generalization [7.843029855730508]
We train a causal Transformer in a setting that renders ordinary learning very difficult. The model can solve the task, however, by utilizing earlier examples to generalize to later ones. In evaluations on the datasets, SCAN, COGS, and GeoQuery, models trained in this manner indeed show improved compositional generalization.
arXiv Detail & Related papers (2024-03-18T14:45:52Z)
The Unreasonable Effectiveness of Easy Training Data for Hard Tasks [84.30018805150607]
We present the surprising conclusion that current pretrained language models often generalize relatively well from easy to hard data. We demonstrate this kind of easy-to-hard generalization using simple finetuning methods like in-context learning, linear heads, and QLoRA. We conclude that easy-to-hard generalization in LMs is surprisingly strong for the tasks studied.
arXiv Detail & Related papers (2024-01-12T18:36:29Z)
Simplicity Bias Leads to Amplified Performance Disparities [8.60453031364566]
We show that SGD-trained models have a bias towards simplicity, leading them to prioritize learning a majority class. A model may prioritize any class or group of the dataset that it finds simple-at the expense of what it finds complex.
arXiv Detail & Related papers (2022-12-13T15:24:41Z)
Evading the Simplicity Bias: Training a Diverse Set of Models Discovers Solutions with Superior OOD Generalization [93.8373619657239]
Neural networks trained with SGD were recently shown to rely preferentially on linearly-predictive features. This simplicity bias can explain their lack of robustness out of distribution (OOD) We demonstrate that the simplicity bias can be mitigated and OOD generalization improved.
arXiv Detail & Related papers (2021-05-12T12:12:24Z)
On the Pitfalls of Learning with Limited Data: A Facial Expression Recognition Case Study [0.5249805590164901]
We focus on the problem of Facial Expression Recognition from videos. We performed an extensive study with four databases at a different complexity and nine deep-learning architectures for video classification. We found that complex training sets translate better to more stable test sets when trained with transfer learning and synthetically generated data.
arXiv Detail & Related papers (2021-04-02T18:53:41Z)
Learning What Makes a Difference from Counterfactual Examples and Gradient Supervision [57.14468881854616]
We propose an auxiliary training objective that improves the generalization capabilities of neural networks. We use pairs of minimally-different examples with different labels, a.k.a counterfactual or contrasting examples, which provide a signal indicative of the underlying causal structure of the task. Models trained with this technique demonstrate improved performance on out-of-distribution test sets.
arXiv Detail & Related papers (2020-04-20T02:47:49Z)
A Close Look at Deep Learning with Small Data [0.0]
We show that model complexity is a critical factor when only a few samples per class are available. We also show that even standard data augmentation can boost recognition performance by large margins.
arXiv Detail & Related papers (2020-03-28T17:11:29Z)
Robust and On-the-fly Dataset Denoising for Image Classification [72.10311040730815]
On-the-fly Data Denoising (ODD) is robust to mislabeled examples, while introducing almost zero computational overhead compared to standard training. ODD is able to achieve state-of-the-art results on a wide range of datasets including real-world ones such as WebVision and Clothing1M.
arXiv Detail & Related papers (2020-03-24T03:59:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.