Related papers: Data Augmentation for Neural Machine Translation using Generative Language Model

Data Augmentation for Neural Machine Translation using Generative Language Model

URL: http://arxiv.org/abs/2307.16833v2
Date: Mon, 13 Nov 2023 13:17:03 GMT
Title: Data Augmentation for Neural Machine Translation using Generative Language Model
Authors: Seokjin Oh, Su Ah Lee and Woohwan Jung
Abstract summary: The scarcity of large parallel corpora remains the main bottleneck in Neural Machine Translation. Data augmentation is a technique that enhances the performance of data-hungry models by generating synthetic data instead of collecting new ones. We explore prompt-based data augmentation approaches that leverage large-scale language models such as ChatGPT.
Score: 1.5500145658862499
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite the rapid growth in model architecture, the scarcity of large parallel corpora remains the main bottleneck in Neural Machine Translation. Data augmentation is a technique that enhances the performance of data-hungry models by generating synthetic data instead of collecting new ones. We explore prompt-based data augmentation approaches that leverage large-scale language models such as ChatGPT. To create a synthetic parallel corpus, we compare 3 methods using different prompts. We employ two assessment metrics to measure the diversity of the generated synthetic data. This approach requires no further model training cost, which is mandatory in other augmentation methods like back-translation. The proposed method improves the unaugmented baseline by 0.68 BLEU score.

Related papers

Scaling Laws of Synthetic Data for Language Models [132.67350443447611]
We introduce SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets. Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm.
arXiv Detail & Related papers (2025-03-25T11:07:12Z)
BARE: Combining Base and Instruction-Tuned Language Models for Better Synthetic Data Generation [71.46236155101032]
We propose Base-Refine, a synthetic data generation method that combines the diversity of base models with the quality of instruct-tuned models. We show that fine-tuning with BARE-generated data achieves a 101% improvement over instruct-only data on GSM8K and a 18.4% improvement over SOTA methods on RAFT.
arXiv Detail & Related papers (2025-02-03T00:12:40Z)
Multi-Armed Bandit Approach for Optimizing Training on Synthetic Data [7.603659241572307]
We propose a novel UCB-based training procedure combined with a dynamic usability metric. Our proposed metric integrates low-level and high-level information from synthetic images and their corresponding real and synthetic datasets. We show that our metric is an effective way to rank synthetic images based on their usability.
arXiv Detail & Related papers (2024-12-06T23:36:36Z)
GASE: Generatively Augmented Sentence Encoding [0.0]
We propose an approach to enhance sentence embeddings by applying generative text models for data augmentation at inference time. Generatively Augmented Sentence uses diverse synthetic variants of input texts generated by paraphrasing, summarising or extracting keywords. We find that generative augmentation leads to larger performance improvements for embedding models with lower baseline performance.
arXiv Detail & Related papers (2024-11-07T17:53:47Z)
Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis [21.210982054134686]
Methods for joint and unified synthesis of speech audio and co-speech 3D gesture motion from text are a new and emerging field. Existing methods are trained on parallel data from all constituent modalities. Inspired by student-teacher methods, we propose a straightforward solution to the data shortage, by simply synthesising additional training material.
arXiv Detail & Related papers (2024-04-30T15:22:19Z)
Synthetic Pre-Training Tasks for Neural Machine Translation [16.6378815054841]
Our goal is to understand the factors that contribute to the effectiveness of pre-training models when using synthetic resources. We propose several novel approaches to pre-training translation models that involve different levels of lexical and structural knowledge. Our experiments on multiple language pairs reveal that pre-training benefits can be realized even with high levels of obfuscation or purely synthetic parallel data.
arXiv Detail & Related papers (2022-12-19T21:34:00Z)
Continual Knowledge Distillation for Neural Machine Translation [74.03622486218597]
parallel corpora are not publicly accessible for data copyright, data privacy and competitive differentiation reasons. We propose a method called continual knowledge distillation to take advantage of existing translation models to improve one model of interest.
arXiv Detail & Related papers (2022-12-18T14:41:13Z)
Transformers as Neural Augmentors: Class Conditional Sentence Generation via Variational Bayes [0.0]
We propose a neural data augmentation method, which is a combination of Variational Autoencoder and encoder-decoder Transformer model. While encoding and decoding the input sentence, our model captures the syntactic and semantic representation of the input language with its class condition. Our model increases the performance of current models compared to other data augmentation techniques with a small amount of computation power.
arXiv Detail & Related papers (2022-05-19T08:42:33Z)
Learning to Generalize to More: Continuous Semantic Augmentation for Neural Machine Translation [50.54059385277964]
We present a novel data augmentation paradigm termed Continuous Semantic Augmentation (CsaNMT) CsaNMT augments each training instance with an adjacency region that could cover adequate variants of literal expression under the same meaning.
arXiv Detail & Related papers (2022-04-14T08:16:28Z)
Improving Neural Machine Translation by Bidirectional Training [85.64797317290349]
We present a simple and effective pretraining strategy -- bidirectional training (BiT) for neural machine translation. Specifically, we bidirectionally update the model parameters at the early stage and then tune the model normally. Experimental results show that BiT pushes the SOTA neural machine translation performance across 15 translation tasks on 8 language pairs significantly higher.
arXiv Detail & Related papers (2021-09-16T07:58:33Z)
How much pretraining data do language models need to learn syntax? [12.668478784932878]
Transformers-based pretrained language models achieve outstanding results in many well-known NLU benchmarks. We study the impact of pretraining data size on the knowledge of the models using RoBERTa.
arXiv Detail & Related papers (2021-09-07T15:51:39Z)
Enriching Non-Autoregressive Transformer with Syntactic and SemanticStructures for Neural Machine Translation [54.864148836486166]
We propose to incorporate the explicit syntactic and semantic structures of languages into a non-autoregressive Transformer. Our model achieves a significantly faster speed, as well as keeps the translation quality when compared with several state-of-the-art non-autoregressive models.
arXiv Detail & Related papers (2021-01-22T04:12:17Z)
SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation. Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z)
Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting. Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking. We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.