On Synthetic Data for Back Translation
- URL: http://arxiv.org/abs/2310.13675v1
- Date: Fri, 20 Oct 2023 17:24:12 GMT
- Title: On Synthetic Data for Back Translation
- Authors: Jiahao Xu, Yubin Ruan, Wei Bi, Guoping Huang, Shuming Shi, Lihui Chen,
Lemao Liu
- Abstract summary: Back translation (BT) is one of the most significant technologies in NMT research fields.
We identify two key factors on synthetic data controlling the back-translation NMT performance, which are quality and importance.
We propose a simple yet effective method to generate synthetic data to better trade off both factors so as to yield a better performance for BT.
- Score: 66.6342561585953
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Back translation (BT) is one of the most significant technologies in NMT
research fields. Existing attempts on BT share a common characteristic: they
employ either beam search or random sampling to generate synthetic data with a
backward model but seldom work studies the role of synthetic data in the
performance of BT. This motivates us to ask a fundamental question: {\em what
kind of synthetic data contributes to BT performance?} Through both theoretical
and empirical studies, we identify two key factors on synthetic data
controlling the back-translation NMT performance, which are quality and
importance. Furthermore, based on our findings, we propose a simple yet
effective method to generate synthetic data to better trade off both factors so
as to yield a better performance for BT. We run extensive experiments on WMT14
DE-EN, EN-DE, and RU-EN benchmark tasks. By employing our proposed method to
generate synthetic data, our BT model significantly outperforms the standard BT
baselines (i.e., beam and sampling based methods for data generation), which
proves the effectiveness of our proposed methods.
Related papers
- Little Giants: Synthesizing High-Quality Embedding Data at Scale [71.352883755806]
We introduce SPEED, a framework that aligns open-source small models to efficiently generate large-scale embedding data.
SPEED uses only less than 1/10 of the GPT API calls, outperforming the state-of-the-art embedding model E5_mistral when both are trained solely on their synthetic data.
arXiv Detail & Related papers (2024-10-24T10:47:30Z) - On the Diversity of Synthetic Data and its Impact on Training Large Language Models [34.00031258223175]
Large Language Models (LLMs) have accentuated the need for diverse, high-quality pre-training data.
Synthetic data emerges as a viable solution to the challenges of data scarcity and inaccessibility.
We study the downstream effects of synthetic data diversity during both the pre-training and fine-tuning stages.
arXiv Detail & Related papers (2024-10-19T22:14:07Z) - Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs)
Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws.
Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z) - Assessment of Differentially Private Synthetic Data for Utility and
Fairness in End-to-End Machine Learning Pipelines for Tabular Data [3.555830838738963]
Differentially private (DP) synthetic data sets are a solution for sharing data while preserving the privacy of individual data providers.
We identify the most effective synthetic data generation techniques for training and evaluating machine learning models.
arXiv Detail & Related papers (2023-10-30T03:37:16Z) - TarGEN: Targeted Data Generation with Large Language Models [51.87504111286201]
TarGEN is a multi-step prompting strategy for generating high-quality synthetic datasets.
We augment TarGEN with a method known as self-correction empowering LLMs to rectify inaccurately labeled instances.
A comprehensive analysis of the synthetic dataset compared to the original dataset reveals similar or higher levels of dataset complexity and diversity.
arXiv Detail & Related papers (2023-10-27T03:32:17Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Does Synthetic Data Make Large Language Models More Efficient? [0.0]
This paper explores the nuances of synthetic data generation in NLP.
We highlight its advantages, including data augmentation potential and the introduction of structured variety.
We demonstrate the impact of template-based synthetic data on the performance of modern transformer models.
arXiv Detail & Related papers (2023-10-11T19:16:09Z) - Alternated Training with Synthetic and Authentic Data for Neural Machine
Translation [49.35605028467887]
We propose alternated training with synthetic and authentic data for neural machine translation (NMT)
Compared with previous work, we introduce authentic data as guidance to prevent the training of NMT models from being disturbed by noisy synthetic data.
Experiments on Chinese-English and German-English translation tasks show that our approach improves the performance over several strong baselines.
arXiv Detail & Related papers (2021-06-16T07:13:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.