TreeMix: Compositional Constituency-based Data Augmentation for Natural
Language Understanding
- URL: http://arxiv.org/abs/2205.06153v1
- Date: Thu, 12 May 2022 15:25:12 GMT
- Title: TreeMix: Compositional Constituency-based Data Augmentation for Natural
Language Understanding
- Authors: Le Zhang, Zichao Yang, Diyi Yang
- Abstract summary: We propose a compositional data augmentation approach for natural language understanding called TreeMix.
Specifically, TreeMix leverages constituency parsing tree to decompose sentences into constituent sub-structures and the Mixup data augmentation technique to recombine them to generate new sentences.
Compared with previous approaches, TreeMix introduces greater diversity to the samples generated and encourages models to learn compositionality of NLP data.
- Score: 56.794981024301094
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data augmentation is an effective approach to tackle over-fitting. Many
previous works have proposed different data augmentations strategies for NLP,
such as noise injection, word replacement, back-translation etc. Though
effective, they missed one important characteristic of
language--compositionality, meaning of a complex expression is built from its
sub-parts. Motivated by this, we propose a compositional data augmentation
approach for natural language understanding called TreeMix. Specifically,
TreeMix leverages constituency parsing tree to decompose sentences into
constituent sub-structures and the Mixup data augmentation technique to
recombine them to generate new sentences. Compared with previous approaches,
TreeMix introduces greater diversity to the samples generated and encourages
models to learn compositionality of NLP data. Extensive experiments on text
classification and SCAN demonstrate that TreeMix outperforms current
state-of-the-art data augmentation methods.
Related papers
- DoubleMix: Simple Interpolation-Based Data Augmentation for Text
Classification [56.817386699291305]
This paper proposes a simple yet effective data augmentation approach termed DoubleMix.
DoubleMix first generates several perturbed samples for each training data.
It then uses the perturbed data and original data to carry out a two-step in the hidden space of neural models.
arXiv Detail & Related papers (2022-09-12T15:01:04Z) - SUBS: Subtree Substitution for Compositional Semantic Parsing [50.63574492655072]
We propose to use subtree substitution for compositional data augmentation, where we consider subtrees with similar semantic functions as exchangeable.
Experiments showed that such augmented data led to significantly better performance on SCAN and GeoQuery, and reached new SOTA on compositional split of GeoQuery.
arXiv Detail & Related papers (2022-05-03T14:47:35Z) - Incorporating Constituent Syntax for Coreference Resolution [50.71868417008133]
We propose a graph-based method to incorporate constituent syntactic structures.
We also explore to utilise higher-order neighbourhood information to encode rich structures in constituent trees.
Experiments on the English and Chinese portions of OntoNotes 5.0 benchmark show that our proposed model either beats a strong baseline or achieves new state-of-the-art performance.
arXiv Detail & Related papers (2022-02-22T07:40:42Z) - ALP: Data Augmentation using Lexicalized PCFGs for Few-Shot Text
Classification [11.742065170002162]
We present the data augmentation using Lexicalized Probabilistic context-free grammars (ALP)
Experiments on few-shot text classification tasks demonstrate that ALP enhances many state-of-the-art classification methods.
We argue empirically that the traditional splitting of training and validation sets is sub-optimal compared to our novel augmentation-based splitting strategies.
arXiv Detail & Related papers (2021-12-16T09:56:35Z) - Sequence-Level Mixed Sample Data Augmentation [119.94667752029143]
This work proposes a simple data augmentation approach to encourage compositional behavior in neural models for sequence-to-sequence problems.
Our approach, SeqMix, creates new synthetic examples by softly combining input/output sequences from the training set.
arXiv Detail & Related papers (2020-11-18T02:18:04Z) - Syntax-aware Data Augmentation for Neural Machine Translation [76.99198797021454]
We propose a novel data augmentation strategy for neural machine translation.
We set sentence-specific probability for word selection by considering their roles in sentence.
Our proposed method is evaluated on WMT14 English-to-German dataset and IWSLT14 German-to-English dataset.
arXiv Detail & Related papers (2020-04-29T13:45:30Z) - Stochastic Natural Language Generation Using Dependency Information [0.7995360025953929]
This article presents a corpus-based model for generating natural language text.
Our model encodes dependency relations from training data through a feature set, then produces a new dependency tree for a given meaning representation.
We show that our model produces high-quality utterances in aspects of informativeness and naturalness as well as quality.
arXiv Detail & Related papers (2020-01-12T09:40:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.