Related papers: Finding needles in a haystack: Sampling Structurally-diverse Training Sets from Synthetic Data for Compositional Generalization

Finding needles in a haystack: Sampling Structurally-diverse Training Sets from Synthetic Data for Compositional Generalization

URL: http://arxiv.org/abs/2109.02575v1
Date: Mon, 6 Sep 2021 16:20:47 GMT
Title: Finding needles in a haystack: Sampling Structurally-diverse Training Sets from Synthetic Data for Compositional Generalization
Authors: Inbar Oren, Jonathan Herzig and Jonathan Berant
Abstract summary: We investigate automatic generation of synthetic utterance-program pairs for improving compositional generalization in semantic parsing. We select a subset of synthetic examples that are structurally-diverse and use them to improve compositional generalization. We evaluate our approach on a new split of the schema2QA dataset, and show that it leads to dramatic improvements in compositional generalization as well as moderate improvements in the traditional i.i.d setup.
Score: 33.30539396439008
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern semantic parsers suffer from two principal limitations. First, training requires expensive collection of utterance-program pairs. Second, semantic parsers fail to generalize at test time to new compositions/structures that have not been observed during training. Recent research has shown that automatic generation of synthetic utterance-program pairs can alleviate the first problem, but its potential for the second has thus far been under-explored. In this work, we investigate automatic generation of synthetic utterance-program pairs for improving compositional generalization in semantic parsing. Given a small training set of annotated examples and an "infinite" pool of synthetic examples, we select a subset of synthetic examples that are structurally-diverse and use them to improve compositional generalization. We evaluate our approach on a new split of the schema2QA dataset, and show that it leads to dramatic improvements in compositional generalization as well as moderate improvements in the traditional i.i.d setup. Moreover, structurally-diverse sampling achieves these improvements with as few as 5K examples, compared to 1M examples when sampling uniformly at random -- a 200x improvement in data efficiency.

Related papers

The Synthetic Imputation Approach: Generating Optimal Synthetic Texts For Underrepresented Categories In Supervised Classification Tasks [0.0]
It is often difficult to find sufficient examples for all categories in a task when building a high-quality training set. I propose a solution, the synthetic imputation approach. This approach generates synthetic texts based on careful prompting and five original examples drawn randomly with replacement from the sample. With 75 original examples or more, synthetic imputation's performance is on par with a full sample of original texts, and overfitting remains low, predictable and correctable with 50 original samples.
arXiv Detail & Related papers (2025-04-21T15:07:26Z)
ExeDec: Execution Decomposition for Compositional Generalization in Neural Program Synthesis [54.18659323181771]
We characterize several different forms of compositional generalization that are desirable in program synthesis. We propose ExeDec, a novel decomposition-based strategy that predicts execution subgoals to solve problems step-by-step informed by program execution at each step.
arXiv Detail & Related papers (2023-07-26T01:07:52Z)
Mutual Exclusivity Training and Primitive Augmentation to Induce Compositionality [84.94877848357896]
Recent datasets expose the lack of the systematic generalization ability in standard sequence-to-sequence models. We analyze this behavior of seq2seq models and identify two contributing factors: a lack of mutual exclusivity bias and the tendency to memorize whole examples. We show substantial empirical improvements using standard sequence-to-sequence models on two widely-used compositionality datasets.
arXiv Detail & Related papers (2022-11-28T17:36:41Z)
Semantic Self-adaptation: Enhancing Generalization with a Single Sample [45.111358665370524]
We propose a self-adaptive approach for semantic segmentation. It fine-tunes the parameters of convolutional layers to the input image using consistency regularization. Our empirical study suggests that self-adaptation may complement the established practice of model regularization at training time.
arXiv Detail & Related papers (2022-08-10T12:29:01Z)
Compositional Generalization and Decomposition in Neural Program Synthesis [59.356261137313275]
In this paper, we focus on measuring the ability of learned program synthesizers to compositionally generalize. We first characterize several different axes along which program synthesis methods would be desired to generalize. We introduce a benchmark suite of tasks to assess these abilities based on two popular existing datasets.
arXiv Detail & Related papers (2022-04-07T22:16:05Z)
Structurally Diverse Sampling Reduces Spurious Correlations in Semantic Parsing Datasets [51.095144091781734]
We propose a novel algorithm for sampling a structurally diverse set of instances from a labeled instance pool with structured outputs. We show that our algorithm performs competitively with or better than prior algorithms in not only compositional template splits but also traditional IID splits. In general, we find that diverse train sets lead to better generalization than random training sets of the same size in 9 out of 10 dataset-split pairs.
arXiv Detail & Related papers (2022-03-16T07:41:27Z)
Compositional Generalization Requires Compositional Parsers [69.77216620997305]
We compare sequence-to-sequence models and models guided by compositional principles on the recent COGS corpus. We show structural generalization is a key measure of compositional generalization and requires models that are aware of complex structure.
arXiv Detail & Related papers (2022-02-24T07:36:35Z)
Sequence-Level Mixed Sample Data Augmentation [119.94667752029143]
This work proposes a simple data augmentation approach to encourage compositional behavior in neural models for sequence-to-sequence problems. Our approach, SeqMix, creates new synthetic examples by softly combining input/output sequences from the training set.
arXiv Detail & Related papers (2020-11-18T02:18:04Z)
Compositional Generalization and Natural Language Variation: Can a Semantic Parsing Approach Handle Both? [27.590858384414567]
We ask: can we develop a semantic parsing approach that handles both natural language variation and compositional generalization? We propose new train and test splits of non-synthetic datasets to better assess this capability. We also propose NQG-T5, a hybrid model that combines a high-precision grammar-based approach with a pre-trained sequence-to-sequence model.
arXiv Detail & Related papers (2020-10-24T00:38:27Z)
Learning to Recombine and Resample Data for Compositional Generalization [35.868789086531685]
We describe R&R, a learned data augmentation scheme that enables a large category of compositional generalizations without appeal to latent symbolic structure. R&R has two components: recombination of original training examples via a prototype-based generative model and resampling of generated examples to encourage extrapolation.
arXiv Detail & Related papers (2020-10-08T00:36:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.