Finding needles in a haystack: Sampling Structurally-diverse Training
Sets from Synthetic Data for Compositional Generalization
- URL: http://arxiv.org/abs/2109.02575v1
- Date: Mon, 6 Sep 2021 16:20:47 GMT
- Title: Finding needles in a haystack: Sampling Structurally-diverse Training
Sets from Synthetic Data for Compositional Generalization
- Authors: Inbar Oren, Jonathan Herzig and Jonathan Berant
- Abstract summary: We investigate automatic generation of synthetic utterance-program pairs for improving compositional generalization in semantic parsing.
We select a subset of synthetic examples that are structurally-diverse and use them to improve compositional generalization.
We evaluate our approach on a new split of the schema2QA dataset, and show that it leads to dramatic improvements in compositional generalization as well as moderate improvements in the traditional i.i.d setup.
- Score: 33.30539396439008
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern semantic parsers suffer from two principal limitations. First,
training requires expensive collection of utterance-program pairs. Second,
semantic parsers fail to generalize at test time to new compositions/structures
that have not been observed during training. Recent research has shown that
automatic generation of synthetic utterance-program pairs can alleviate the
first problem, but its potential for the second has thus far been
under-explored. In this work, we investigate automatic generation of synthetic
utterance-program pairs for improving compositional generalization in semantic
parsing. Given a small training set of annotated examples and an "infinite"
pool of synthetic examples, we select a subset of synthetic examples that are
structurally-diverse and use them to improve compositional generalization. We
evaluate our approach on a new split of the schema2QA dataset, and show that it
leads to dramatic improvements in compositional generalization as well as
moderate improvements in the traditional i.i.d setup. Moreover,
structurally-diverse sampling achieves these improvements with as few as 5K
examples, compared to 1M examples when sampling uniformly at random -- a 200x
improvement in data efficiency.
Related papers
- ExeDec: Execution Decomposition for Compositional Generalization in Neural Program Synthesis [54.18659323181771]
We characterize several different forms of compositional generalization that are desirable in program synthesis.
We propose ExeDec, a novel decomposition-based strategy that predicts execution subgoals to solve problems step-by-step informed by program execution at each step.
arXiv Detail & Related papers (2023-07-26T01:07:52Z) - Mutual Exclusivity Training and Primitive Augmentation to Induce
Compositionality [84.94877848357896]
Recent datasets expose the lack of the systematic generalization ability in standard sequence-to-sequence models.
We analyze this behavior of seq2seq models and identify two contributing factors: a lack of mutual exclusivity bias and the tendency to memorize whole examples.
We show substantial empirical improvements using standard sequence-to-sequence models on two widely-used compositionality datasets.
arXiv Detail & Related papers (2022-11-28T17:36:41Z) - Semantic Self-adaptation: Enhancing Generalization with a Single Sample [45.111358665370524]
We propose a self-adaptive approach for semantic segmentation.
It fine-tunes the parameters of convolutional layers to the input image using consistency regularization.
Our empirical study suggests that self-adaptation may complement the established practice of model regularization at training time.
arXiv Detail & Related papers (2022-08-10T12:29:01Z) - Compositional Generalization and Decomposition in Neural Program
Synthesis [59.356261137313275]
In this paper, we focus on measuring the ability of learned program synthesizers to compositionally generalize.
We first characterize several different axes along which program synthesis methods would be desired to generalize.
We introduce a benchmark suite of tasks to assess these abilities based on two popular existing datasets.
arXiv Detail & Related papers (2022-04-07T22:16:05Z) - Structurally Diverse Sampling Reduces Spurious Correlations in Semantic
Parsing Datasets [51.095144091781734]
We propose a novel algorithm for sampling a structurally diverse set of instances from a labeled instance pool with structured outputs.
We show that our algorithm performs competitively with or better than prior algorithms in not only compositional template splits but also traditional IID splits.
In general, we find that diverse train sets lead to better generalization than random training sets of the same size in 9 out of 10 dataset-split pairs.
arXiv Detail & Related papers (2022-03-16T07:41:27Z) - Compositional Generalization Requires Compositional Parsers [69.77216620997305]
We compare sequence-to-sequence models and models guided by compositional principles on the recent COGS corpus.
We show structural generalization is a key measure of compositional generalization and requires models that are aware of complex structure.
arXiv Detail & Related papers (2022-02-24T07:36:35Z) - Sequence-Level Mixed Sample Data Augmentation [119.94667752029143]
This work proposes a simple data augmentation approach to encourage compositional behavior in neural models for sequence-to-sequence problems.
Our approach, SeqMix, creates new synthetic examples by softly combining input/output sequences from the training set.
arXiv Detail & Related papers (2020-11-18T02:18:04Z) - Compositional Generalization and Natural Language Variation: Can a
Semantic Parsing Approach Handle Both? [27.590858384414567]
We ask: can we develop a semantic parsing approach that handles both natural language variation and compositional generalization?
We propose new train and test splits of non-synthetic datasets to better assess this capability.
We also propose NQG-T5, a hybrid model that combines a high-precision grammar-based approach with a pre-trained sequence-to-sequence model.
arXiv Detail & Related papers (2020-10-24T00:38:27Z) - Learning to Recombine and Resample Data for Compositional Generalization [35.868789086531685]
We describe R&R, a learned data augmentation scheme that enables a large category of compositional generalizations without appeal to latent symbolic structure.
R&R has two components: recombination of original training examples via a prototype-based generative model and resampling of generated examples to encourage extrapolation.
arXiv Detail & Related papers (2020-10-08T00:36:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.