Syntax-driven Data Augmentation for Named Entity Recognition
- URL: http://arxiv.org/abs/2208.06957v1
- Date: Mon, 15 Aug 2022 01:24:55 GMT
- Title: Syntax-driven Data Augmentation for Named Entity Recognition
- Authors: Arie Pratama Sutiono, Gus Hahn-Powell
- Abstract summary: In low resource settings, data augmentation strategies are commonly leveraged to improve performance.
We compare simple masked language model replacement and an augmentation method using constituency tree mutations to improve named entity recognition.
- Score: 3.0603554929274908
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In low resource settings, data augmentation strategies are commonly leveraged
to improve performance. Numerous approaches have attempted document-level
augmentation (e.g., text classification), but few studies have explored
token-level augmentation. Performed naively, data augmentation can produce
semantically incongruent and ungrammatical examples. In this work, we compare
simple masked language model replacement and an augmentation method using
constituency tree mutations to improve the performance of named entity
recognition in low-resource settings with the aim of preserving linguistic
cohesion of the augmented sentences.
Related papers
- GASE: Generatively Augmented Sentence Encoding [0.0]
We propose an approach to enhance sentence embeddings by applying generative text models for data augmentation at inference time.
Generatively Augmented Sentence uses diverse synthetic variants of input texts generated by paraphrasing, summarising or extracting keywords.
We find that generative augmentation leads to larger performance improvements for embedding models with lower baseline performance.
arXiv Detail & Related papers (2024-11-07T17:53:47Z) - Distributional Data Augmentation Methods for Low Resource Language [0.9208007322096533]
Easy data augmentation (EDA) augments the training data by injecting and replacing synonyms and randomly permuting sentences.
One major obstacle with EDA is the need for versatile and complete synonym dictionaries, which cannot be easily found in low-resource languages.
We propose two extensions, easy distributional data augmentation (EDDA) and type specific similar word replacement (TSSR), which uses semantic word context information and part-of-speech tags for word replacement and augmentation.
arXiv Detail & Related papers (2023-09-09T19:01:59Z) - GDA: Generative Data Augmentation Techniques for Relation Extraction
Tasks [81.51314139202152]
We propose a dedicated augmentation technique for relational texts, named GDA, which uses two complementary modules to preserve both semantic consistency and syntax structures.
Experimental results in three datasets under a low-resource setting showed that GDA could bring em 2.0% F1 improvements compared with no augmentation technique.
arXiv Detail & Related papers (2023-05-26T06:21:01Z) - Adversarial Word Dilution as Text Data Augmentation in Low-Resource
Regime [35.95241861664597]
This paper proposes an Adversarial Word Dilution (AWD) method that can generate hard positive examples as text data augmentations.
Our idea of augmenting the text data is to dilute the embedding of strong positive words by weighted mixing with unknown-word embedding.
Empirical studies on three benchmark datasets show that AWD can generate more effective data augmentations and outperform the state-of-the-art text data augmentation methods.
arXiv Detail & Related papers (2023-05-16T08:46:11Z) - TreeMix: Compositional Constituency-based Data Augmentation for Natural
Language Understanding [56.794981024301094]
We propose a compositional data augmentation approach for natural language understanding called TreeMix.
Specifically, TreeMix leverages constituency parsing tree to decompose sentences into constituent sub-structures and the Mixup data augmentation technique to recombine them to generate new sentences.
Compared with previous approaches, TreeMix introduces greater diversity to the samples generated and encourages models to learn compositionality of NLP data.
arXiv Detail & Related papers (2022-05-12T15:25:12Z) - SUBS: Subtree Substitution for Compositional Semantic Parsing [50.63574492655072]
We propose to use subtree substitution for compositional data augmentation, where we consider subtrees with similar semantic functions as exchangeable.
Experiments showed that such augmented data led to significantly better performance on SCAN and GeoQuery, and reached new SOTA on compositional split of GeoQuery.
arXiv Detail & Related papers (2022-05-03T14:47:35Z) - ALP: Data Augmentation using Lexicalized PCFGs for Few-Shot Text
Classification [11.742065170002162]
We present the data augmentation using Lexicalized Probabilistic context-free grammars (ALP)
Experiments on few-shot text classification tasks demonstrate that ALP enhances many state-of-the-art classification methods.
We argue empirically that the traditional splitting of training and validation sets is sub-optimal compared to our novel augmentation-based splitting strategies.
arXiv Detail & Related papers (2021-12-16T09:56:35Z) - To Augment or Not to Augment? A Comparative Study on Text Augmentation
Techniques for Low-Resource NLP [0.0]
We investigate three categories of text augmentation methodologies which perform changes on the syntax.
We compare them on part-of-speech tagging, dependency parsing and semantic role labeling for a diverse set of language families.
Our results suggest that the augmentation techniques can further improve over strong baselines based on mBERT.
arXiv Detail & Related papers (2021-11-18T10:52:48Z) - Smart(Sampling)Augment: Optimal and Efficient Data Augmentation for
Semantic Segmentation [68.8204255655161]
We provide the first study on semantic image segmentation and introduce two new approaches: textitSmartAugment and textitSmartSamplingAugment.
SmartAugment uses Bayesian Optimization to search over a rich space of augmentation strategies and achieves a new state-of-the-art performance in all semantic segmentation tasks we consider.
SmartSamplingAugment, a simple parameter-free approach with a fixed augmentation strategy competes in performance with the existing resource-intensive approaches and outperforms cheap state-of-the-art data augmentation methods.
arXiv Detail & Related papers (2021-10-31T13:04:45Z) - SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation.
Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z) - Probing Linguistic Features of Sentence-Level Representations in Neural
Relation Extraction [80.38130122127882]
We introduce 14 probing tasks targeting linguistic properties relevant to neural relation extraction (RE)
We use them to study representations learned by more than 40 different encoder architecture and linguistic feature combinations trained on two datasets.
We find that the bias induced by the architecture and the inclusion of linguistic features are clearly expressed in the probing task performance.
arXiv Detail & Related papers (2020-04-17T09:17:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.