Substructure Substitution: Structured Data Augmentation for NLP
- URL: http://arxiv.org/abs/2101.00411v1
- Date: Sat, 2 Jan 2021 09:54:24 GMT
- Title: Substructure Substitution: Structured Data Augmentation for NLP
- Authors: Haoyue Shi, Karen Livescu, Kevin Gimpel
- Abstract summary: SUB2 generates new examples by substituting substructures with ones with the same label.
For more general tasks, we present variations of SUB2 based on constituency parse trees.
For most cases, training with the augmented dataset by SUB2 achieves better performance than training with the original training set.
- Score: 55.69800855705232
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study a family of data augmentation methods, substructure substitution
(SUB2), for natural language processing (NLP) tasks. SUB2 generates new
examples by substituting substructures (e.g., subtrees or subsequences) with
ones with the same label, which can be applied to many structured NLP tasks
such as part-of-speech tagging and parsing. For more general tasks (e.g., text
classification) which do not have explicitly annotated substructures, we
present variations of SUB2 based on constituency parse trees, introducing
structure-aware data augmentation methods to general NLP tasks. For most cases,
training with the augmented dataset by SUB2 achieves better performance than
training with the original training set. Further experiments show that SUB2 has
more consistent performance than other investigated augmentation methods,
across different tasks and sizes of the seed dataset.
Related papers
- Evaluating representation learning on the protein structure universe [19.856785982914243]
ProteinWorkshop is a benchmark suite for representation learning on protein structures with Graph Neural Networks.
We consider large-scale pre-training and downstream tasks on both experimental and predicted structures.
We find that: (1) large-scale pretraining on AlphaFold structures and auxiliary tasks consistently improve the performance of both rotation-invariant and equivariant GNNs.
arXiv Detail & Related papers (2024-06-19T21:48:34Z) - Unsupervised Chunking with Hierarchical RNN [62.15060807493364]
This paper introduces an unsupervised approach to chunking, a syntactic task that involves grouping words in a non-hierarchical manner.
We present a two-layer Hierarchical Recurrent Neural Network (HRNN) designed to model word-to-chunk and chunk-to-sentence compositions.
Experiments on the CoNLL-2000 dataset reveal a notable improvement over existing unsupervised methods, enhancing phrase F1 score by up to 6 percentage points.
arXiv Detail & Related papers (2023-09-10T02:55:12Z) - Learning to Paraphrase Sentences to Different Complexity Levels [3.0273878903284275]
Sentence simplification is an active research topic in NLP, but its adjacent tasks of sentence complexification and same-level paraphrasing are not.
To train models on all three tasks, we present two new unsupervised datasets.
arXiv Detail & Related papers (2023-08-04T09:43:37Z) - SUBS: Subtree Substitution for Compositional Semantic Parsing [50.63574492655072]
We propose to use subtree substitution for compositional data augmentation, where we consider subtrees with similar semantic functions as exchangeable.
Experiments showed that such augmented data led to significantly better performance on SCAN and GeoQuery, and reached new SOTA on compositional split of GeoQuery.
arXiv Detail & Related papers (2022-05-03T14:47:35Z) - Structurally Diverse Sampling Reduces Spurious Correlations in Semantic
Parsing Datasets [51.095144091781734]
We propose a novel algorithm for sampling a structurally diverse set of instances from a labeled instance pool with structured outputs.
We show that our algorithm performs competitively with or better than prior algorithms in not only compositional template splits but also traditional IID splits.
In general, we find that diverse train sets lead to better generalization than random training sets of the same size in 9 out of 10 dataset-split pairs.
arXiv Detail & Related papers (2022-03-16T07:41:27Z) - ALP: Data Augmentation using Lexicalized PCFGs for Few-Shot Text
Classification [11.742065170002162]
We present the data augmentation using Lexicalized Probabilistic context-free grammars (ALP)
Experiments on few-shot text classification tasks demonstrate that ALP enhances many state-of-the-art classification methods.
We argue empirically that the traditional splitting of training and validation sets is sub-optimal compared to our novel augmentation-based splitting strategies.
arXiv Detail & Related papers (2021-12-16T09:56:35Z) - SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation.
Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z) - BURT: BERT-inspired Universal Representation from Twin Structure [89.82415322763475]
BURT (BERT inspired Universal Representation from Twin Structure) is capable of generating universal, fixed-size representations for input sequences of any granularity.
Our proposed BURT adopts the Siamese network, learning sentence-level representations from natural language inference dataset and word/phrase-level representations from paraphrasing dataset.
We evaluate BURT across different granularities of text similarity tasks, including STS tasks, SemEval2013 Task 5(a) and some commonly used word similarity tasks.
arXiv Detail & Related papers (2020-04-29T04:01:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.