Related papers: Synthetic Pre-Training Tasks for Neural Machine Translation

Synthetic Pre-Training Tasks for Neural Machine Translation

URL: http://arxiv.org/abs/2212.09864v2
Date: Wed, 31 May 2023 01:34:54 GMT
Title: Synthetic Pre-Training Tasks for Neural Machine Translation
Authors: Zexue He, Graeme Blackwood, Rameswar Panda, Julian McAuley, Rogerio Feris
Abstract summary: Our goal is to understand the factors that contribute to the effectiveness of pre-training models when using synthetic resources. We propose several novel approaches to pre-training translation models that involve different levels of lexical and structural knowledge. Our experiments on multiple language pairs reveal that pre-training benefits can be realized even with high levels of obfuscation or purely synthetic parallel data.
Score: 16.6378815054841
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pre-training models with large crawled corpora can lead to issues such as toxicity and bias, as well as copyright and privacy concerns. A promising way of alleviating such concerns is to conduct pre-training with synthetic tasks and data, since no real-world information is ingested by the model. Our goal in this paper is to understand the factors that contribute to the effectiveness of pre-training models when using synthetic resources, particularly in the context of neural machine translation. We propose several novel approaches to pre-training translation models that involve different levels of lexical and structural knowledge, including: 1) generating obfuscated data from a large parallel corpus 2) concatenating phrase pairs extracted from a small word-aligned corpus, and 3) generating synthetic parallel data without real human language corpora. Our experiments on multiple language pairs reveal that pre-training benefits can be realized even with high levels of obfuscation or purely synthetic parallel data. We hope the findings from our comprehensive empirical analysis will shed light on understanding what matters for NMT pre-training, as well as pave the way for the development of more efficient and less toxic models.

Related papers

Contrastive Decoding for Synthetic Data Generation in Low-Resource Language Modeling [9.380879437204277]
We investigate the benefits of contrastive decoding for generating synthetic corpora.<n>By amplifying the signal from a model that has better performance, we create a synthetic corpus and mix it with the original training data.<n>Our findings show that training on a mixture of synthesized and real data improves performance on the language modeling objective and a range of downstream tasks.
arXiv Detail & Related papers (2025-10-09T14:04:52Z)
Demystifying Synthetic Data in LLM Pre-training: A Systematic Study of Scaling Laws, Benefits, and Pitfalls [25.294408301653576]
Training data plays a crucial role in Large Language Models (LLM) scaling, yet high quality data is of limited supply.<n>We compare natural web data, diverse synthetic types (rephrased text, generated textbooks), and mixtures of natural and synthetic data.<n>We find pre-training on rephrased synthetic data textitalone is not faster than pre-training on natural web texts.
arXiv Detail & Related papers (2025-10-02T03:24:42Z)
Synthetic bootstrapped pretraining [52.92577542049469]
We introduce Synthetic Bootstrapped Pretraining (SBP), a language model (LM) pretraining procedure.<n>SBP first learns a model of relations between documents from the pretraining dataset and then leverages it to synthesize a vast new corpus for joint training.<n>We find SBP consistently improves upon a strong repetition baseline and delivers a significant fraction of performance improvement attainable by an oracle upper bound.
arXiv Detail & Related papers (2025-09-17T22:28:27Z)
Scaling Laws of Synthetic Data for Language Models [132.67350443447611]
We introduce SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets. Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm.
arXiv Detail & Related papers (2025-03-25T11:07:12Z)
Understanding Synthetic Context Extension via Retrieval Heads [51.8869530817334]
We investigate fine-tuning on synthetic data for three long-context tasks that require retrieval and reasoning. We find that models trained on synthetic data fall short of the real data, but surprisingly, the mismatch can be interpreted. Our results shed light on how to interpret synthetic data fine-tuning performance and how to approach creating better data for learning real-world capabilities over long contexts.
arXiv Detail & Related papers (2024-10-29T17:55:00Z)
Synthetic continued pretraining [29.6872772403251]
We propose synthetic continued pretraining on a small corpus of domain-specific documents. We instantiate this proposal with EntiGraph, a synthetic data augmentation algorithm. We show how synthetic data augmentation can "rearrange" knowledge to enable more data-efficient learning.
arXiv Detail & Related papers (2024-09-11T17:21:59Z)
Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis [21.210982054134686]
Methods for joint and unified synthesis of speech audio and co-speech 3D gesture motion from text are a new and emerging field. Existing methods are trained on parallel data from all constituent modalities. Inspired by student-teacher methods, we propose a straightforward solution to the data shortage, by simply synthesising additional training material.
arXiv Detail & Related papers (2024-04-30T15:22:19Z)
What do Large Language Models Learn beyond Language? [10.9650651784511]
We find that pretrained models significantly outperform comparable non-pretrained neural models. Experiments surprisingly reveal that the positive effects of pre-training persist even when pretraining on multi-lingual text or computer code. Our findings suggest a hitherto unexplored deep connection between pre-training and inductive learning abilities of language models.
arXiv Detail & Related papers (2022-10-21T23:43:13Z)
Reweighting Strategy based on Synthetic Data Identification for Sentence Similarity [30.647497555295974]
We train a classifier that identifies machine-written sentences, and observe that the linguistic features of the sentences identified as written by a machine are significantly different from those of human-written sentences. The distilled information from the classifier is then used to train a reliable sentence embedding model. Our model trained on synthetic data generalizes well and outperforms the existing baselines.
arXiv Detail & Related papers (2022-08-29T05:42:22Z)
Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation [49.916963624249355]
A UNMT model is trained on the pseudo parallel data with translated source, and natural source sentences in inference. The source discrepancy between training and inference hinders the translation performance of UNMT models. We propose an online self-training approach, which simultaneously uses the pseudo parallel data natural source, translated target to mimic the inference scenario.
arXiv Detail & Related papers (2022-03-16T04:50:27Z)
How much pretraining data do language models need to learn syntax? [12.668478784932878]
Transformers-based pretrained language models achieve outstanding results in many well-known NLU benchmarks. We study the impact of pretraining data size on the knowledge of the models using RoBERTa.
arXiv Detail & Related papers (2021-09-07T15:51:39Z)
Alternated Training with Synthetic and Authentic Data for Neural Machine Translation [49.35605028467887]
We propose alternated training with synthetic and authentic data for neural machine translation (NMT) Compared with previous work, we introduce authentic data as guidance to prevent the training of NMT models from being disturbed by noisy synthetic data. Experiments on Chinese-English and German-English translation tasks show that our approach improves the performance over several strong baselines.
arXiv Detail & Related papers (2021-06-16T07:13:16Z)
Self-Training Sampling with Monolingual Data Uncertainty for Neural Machine Translation [98.83925811122795]
We propose to improve the sampling procedure by selecting the most informative monolingual sentences to complement the parallel data. We compute the uncertainty of monolingual sentences using the bilingual dictionary extracted from the parallel data. Experimental results on large-scale WMT English$Rightarrow$German and English$Rightarrow$Chinese datasets demonstrate the effectiveness of the proposed approach.
arXiv Detail & Related papers (2021-06-02T05:01:36Z)
Syntactic Structure Distillation Pretraining For Bidirectional Encoders [49.483357228441434]
We introduce a knowledge distillation strategy for injecting syntactic biases into BERT pretraining. We distill the approximate marginal distribution over words in context from the syntactic LM. Our findings demonstrate the benefits of syntactic biases, even in representation learners that exploit large amounts of data.
arXiv Detail & Related papers (2020-05-27T16:44:01Z)
Data Augmentation for Spoken Language Understanding via Pretrained Language Models [113.56329266325902]
Training of spoken language understanding (SLU) models often faces the problem of data scarcity. We put forward a data augmentation method using pretrained language models to boost the variability and accuracy of generated utterances.
arXiv Detail & Related papers (2020-04-29T04:07:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.