Data Augmentation for Neural Machine Translation using Generative
Language Model
- URL: http://arxiv.org/abs/2307.16833v2
- Date: Mon, 13 Nov 2023 13:17:03 GMT
- Title: Data Augmentation for Neural Machine Translation using Generative
Language Model
- Authors: Seokjin Oh, Su Ah Lee and Woohwan Jung
- Abstract summary: The scarcity of large parallel corpora remains the main bottleneck in Neural Machine Translation.
Data augmentation is a technique that enhances the performance of data-hungry models by generating synthetic data instead of collecting new ones.
We explore prompt-based data augmentation approaches that leverage large-scale language models such as ChatGPT.
- Score: 1.5500145658862499
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the rapid growth in model architecture, the scarcity of large
parallel corpora remains the main bottleneck in Neural Machine Translation.
Data augmentation is a technique that enhances the performance of data-hungry
models by generating synthetic data instead of collecting new ones. We explore
prompt-based data augmentation approaches that leverage large-scale language
models such as ChatGPT. To create a synthetic parallel corpus, we compare 3
methods using different prompts. We employ two assessment metrics to measure
the diversity of the generated synthetic data. This approach requires no
further model training cost, which is mandatory in other augmentation methods
like back-translation. The proposed method improves the unaugmented baseline by
0.68 BLEU score.
Related papers
- GASE: Generatively Augmented Sentence Encoding [0.0]
We propose an approach to enhance sentence embeddings by applying generative text models for data augmentation at inference time.
Generatively Augmented Sentence uses diverse synthetic variants of input texts generated by paraphrasing, summarising or extracting keywords.
We find that generative augmentation leads to larger performance improvements for embedding models with lower baseline performance.
arXiv Detail & Related papers (2024-11-07T17:53:47Z) - Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis [21.210982054134686]
Methods for joint and unified synthesis of speech audio and co-speech 3D gesture motion from text are a new and emerging field.
Existing methods are trained on parallel data from all constituent modalities.
Inspired by student-teacher methods, we propose a straightforward solution to the data shortage, by simply synthesising additional training material.
arXiv Detail & Related papers (2024-04-30T15:22:19Z) - Synthetic Pre-Training Tasks for Neural Machine Translation [16.6378815054841]
Our goal is to understand the factors that contribute to the effectiveness of pre-training models when using synthetic resources.
We propose several novel approaches to pre-training translation models that involve different levels of lexical and structural knowledge.
Our experiments on multiple language pairs reveal that pre-training benefits can be realized even with high levels of obfuscation or purely synthetic parallel data.
arXiv Detail & Related papers (2022-12-19T21:34:00Z) - Continual Knowledge Distillation for Neural Machine Translation [74.03622486218597]
parallel corpora are not publicly accessible for data copyright, data privacy and competitive differentiation reasons.
We propose a method called continual knowledge distillation to take advantage of existing translation models to improve one model of interest.
arXiv Detail & Related papers (2022-12-18T14:41:13Z) - Transformers as Neural Augmentors: Class Conditional Sentence Generation
via Variational Bayes [0.0]
We propose a neural data augmentation method, which is a combination of Variational Autoencoder and encoder-decoder Transformer model.
While encoding and decoding the input sentence, our model captures the syntactic and semantic representation of the input language with its class condition.
Our model increases the performance of current models compared to other data augmentation techniques with a small amount of computation power.
arXiv Detail & Related papers (2022-05-19T08:42:33Z) - Learning to Generalize to More: Continuous Semantic Augmentation for
Neural Machine Translation [50.54059385277964]
We present a novel data augmentation paradigm termed Continuous Semantic Augmentation (CsaNMT)
CsaNMT augments each training instance with an adjacency region that could cover adequate variants of literal expression under the same meaning.
arXiv Detail & Related papers (2022-04-14T08:16:28Z) - Improving Neural Machine Translation by Bidirectional Training [85.64797317290349]
We present a simple and effective pretraining strategy -- bidirectional training (BiT) for neural machine translation.
Specifically, we bidirectionally update the model parameters at the early stage and then tune the model normally.
Experimental results show that BiT pushes the SOTA neural machine translation performance across 15 translation tasks on 8 language pairs significantly higher.
arXiv Detail & Related papers (2021-09-16T07:58:33Z) - How much pretraining data do language models need to learn syntax? [12.668478784932878]
Transformers-based pretrained language models achieve outstanding results in many well-known NLU benchmarks.
We study the impact of pretraining data size on the knowledge of the models using RoBERTa.
arXiv Detail & Related papers (2021-09-07T15:51:39Z) - Enriching Non-Autoregressive Transformer with Syntactic and
SemanticStructures for Neural Machine Translation [54.864148836486166]
We propose to incorporate the explicit syntactic and semantic structures of languages into a non-autoregressive Transformer.
Our model achieves a significantly faster speed, as well as keeps the translation quality when compared with several state-of-the-art non-autoregressive models.
arXiv Detail & Related papers (2021-01-22T04:12:17Z) - SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation.
Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.