Data Augmentation for Low-Resource Named Entity Recognition Using
Backtranslation
- URL: http://arxiv.org/abs/2108.11703v1
- Date: Thu, 26 Aug 2021 10:56:39 GMT
- Title: Data Augmentation for Low-Resource Named Entity Recognition Using
Backtranslation
- Authors: Usama Yaseen, Stefan Langer
- Abstract summary: We adapt backtranslation to generate high quality and linguistically diverse synthetic data for low-resource named entity recognition.
We perform experiments on two datasets from the materials science (MaSciP) and biomedical domains (S800)
- Score: 1.195496689595016
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The state of art natural language processing systems relies on sizable
training datasets to achieve high performance. Lack of such datasets in the
specialized low resource domains lead to suboptimal performance. In this work,
we adapt backtranslation to generate high quality and linguistically diverse
synthetic data for low-resource named entity recognition. We perform
experiments on two datasets from the materials science (MaSciP) and biomedical
domains (S800). The empirical results demonstrate the effectiveness of our
proposed augmentation strategy, particularly in the low-resource scenario.
Related papers
- An Experimental Study on Data Augmentation Techniques for Named Entity Recognition on Low-Resource Domains [0.9903198600681908]
We evaluate the effectiveness of two prominent text augmentation techniques, Mention Replacement and Contextual Word Replacement, on two widely-used NER models, Bi-LSTM+CRF and BERT.
We conduct experiments on four datasets from low-resource domains, and we explore the impact of various combinations of training subset sizes and number of augmented examples.
arXiv Detail & Related papers (2024-11-21T19:45:48Z) - LLM-DA: Data Augmentation via Large Language Models for Few-Shot Named
Entity Recognition [67.96794382040547]
$LLM-DA$ is a novel data augmentation technique based on large language models (LLMs) for the few-shot NER task.
Our approach involves employing 14 contextual rewriting strategies, designing entity replacements of the same type, and incorporating noise injection to enhance robustness.
arXiv Detail & Related papers (2024-02-22T14:19:56Z) - Strategies for improving low resource speech to text translation relying
on pre-trained ASR models [59.90106959717875]
This paper presents techniques and findings for improving the performance of low-resource speech to text translation (ST)
We conducted experiments on both simulated and real-low resource setups, on language pairs English - Portuguese, and Tamasheq - French respectively.
arXiv Detail & Related papers (2023-05-31T21:58:07Z) - STAR: Boosting Low-Resource Information Extraction by Structure-to-Text
Data Generation with Large Language Models [56.27786433792638]
STAR is a data generation method that leverages Large Language Models (LLMs) to synthesize data instances.
We design fine-grained step-by-step instructions to obtain the initial data instances.
Our experiments show that the data generated by STAR significantly improve the performance of low-resource event extraction and relation extraction tasks.
arXiv Detail & Related papers (2023-05-24T12:15:19Z) - High-Resource Methodological Bias in Low-Resource Investigations [27.419604203739052]
We show that down sampling from a high-resource language results in datasets with different properties than the low-resource datasets.
We conclude that naive down sampling of datasets results in a biased view of how well these systems work in a low-resource scenario.
arXiv Detail & Related papers (2022-11-14T17:04:38Z) - Style Transfer as Data Augmentation: A Case Study on Named Entity
Recognition [17.892385961143173]
We propose a new method to transform the text from a high-resource domain to a low-resource domain by changing its style-related attributes.
We design a constrained decoding algorithm along with a set of key ingredients for data selection to guarantee the generation of valid and coherent data.
Our approach is a practical solution to data scarcity, and we expect it to be applicable to other NLP tasks.
arXiv Detail & Related papers (2022-10-14T16:02:03Z) - A Survey on Low-Resource Neural Machine Translation [106.51056217748388]
We classify related works into three categories according to the auxiliary data they used.
We hope that our survey can help researchers to better understand this field and inspire them to design better algorithms.
arXiv Detail & Related papers (2021-07-09T06:26:38Z) - DAGA: Data Augmentation with a Generation Approach for Low-resource
Tagging Tasks [88.62288327934499]
We propose a novel augmentation method with language models trained on the linearized labeled sentences.
Our method is applicable to both supervised and semi-supervised settings.
arXiv Detail & Related papers (2020-11-03T07:49:15Z) - Exploiting News Article Structure for Automatic Corpus Generation of
Entailment Datasets [1.859931123372708]
We propose a methodology for automatically producing benchmark datasets for low-resource languages using published news articles.
Second, we produce new pretrained transformers based on the ELECTRA technique to further alleviate the resource scarcity in Filipino.
Third, we perform analyses on transfer learning techniques to shed light on their true performance when operating in low-data domains.
arXiv Detail & Related papers (2020-10-22T10:09:10Z) - Dynamic Data Selection and Weighting for Iterative Back-Translation [116.14378571769045]
We propose a curriculum learning strategy for iterative back-translation models.
We evaluate our models on domain adaptation, low-resource, and high-resource MT settings.
Experimental results demonstrate that our methods achieve improvements of up to 1.8 BLEU points over competitive baselines.
arXiv Detail & Related papers (2020-04-07T19:49:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.