Contrastive Decoding for Synthetic Data Generation in Low-Resource Language Modeling
- URL: http://arxiv.org/abs/2510.08245v1
- Date: Thu, 09 Oct 2025 14:04:52 GMT
- Title: Contrastive Decoding for Synthetic Data Generation in Low-Resource Language Modeling
- Authors: Jannek Ulm, Kevin Du, Vésteinn Snæbjarnarson,
- Abstract summary: We investigate the benefits of contrastive decoding for generating synthetic corpora.<n>By amplifying the signal from a model that has better performance, we create a synthetic corpus and mix it with the original training data.<n>Our findings show that training on a mixture of synthesized and real data improves performance on the language modeling objective and a range of downstream tasks.
- Score: 9.380879437204277
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are trained on huge amounts of textual data, and concerns have been raised that the limits of such data may soon be reached. A potential solution is to train on synthetic data sampled from LLMs. In this work, we build on this idea and investigate the benefits of contrastive decoding for generating synthetic corpora. In a controlled setting, we experiment with sampling corpora using the relative difference between a good and bad model trained on the same original corpus of 100 million words. By amplifying the signal from a model that has better performance, we create a synthetic corpus and mix it with the original training data. Our findings show that training on a mixture of synthesized and real data improves performance on the language modeling objective and a range of downstream tasks. In particular, we see that training with a mix of synthetic data from contrastive decoding benefits tasks that require more reasoning skills, while synthetic data from traditional sampling helps more on tasks dependent on surface level linguistic capabilities.
Related papers
- Demystifying Synthetic Data in LLM Pre-training: A Systematic Study of Scaling Laws, Benefits, and Pitfalls [25.294408301653576]
Training data plays a crucial role in Large Language Models (LLM) scaling, yet high quality data is of limited supply.<n>We compare natural web data, diverse synthetic types (rephrased text, generated textbooks), and mixtures of natural and synthetic data.<n>We find pre-training on rephrased synthetic data textitalone is not faster than pre-training on natural web texts.
arXiv Detail & Related papers (2025-10-02T03:24:42Z) - Understanding the Influence of Synthetic Data for Text Embedders [52.04771455432998]
We first reproduce and publicly release the synthetic data proposed by Wang et al.<n>We critically examine where exactly synthetic data improves model generalization.<n>Our findings highlight the limitations of current synthetic data approaches for building general-purpose embedders.
arXiv Detail & Related papers (2025-09-07T19:28:52Z) - Scaling Laws of Synthetic Data for Language Models [125.41600201811417]
We introduce SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets.<n>Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm.
arXiv Detail & Related papers (2025-03-25T11:07:12Z) - Few-shot LLM Synthetic Data with Distribution Matching [37.55363714371521]
Large language models (LLMs) produce high-quality synthetic data to enhance the performance of smaller models.<n>LLMs-generated synthetic data often differs from the real data in key language attributes.<n>We introduce SynAlign: a synthetic data generation and filtering framework based on key attribute distribution matching.
arXiv Detail & Related papers (2025-02-09T16:43:32Z) - SynFER: Towards Boosting Facial Expression Recognition with Synthetic Data [78.70620682374624]
We introduce SynFER, a novel framework for synthesizing facial expression image data based on high-level textual descriptions.<n>To ensure the quality and reliability of the synthetic data, we propose a semantic guidance technique and a pseudo-label generator.<n>Results validate the efficacy of our approach and the synthetic data.
arXiv Detail & Related papers (2024-10-13T14:58:21Z) - Data Generation Using Large Language Models for Text Classification: An Empirical Case Study [15.447491854250227]
We use natural language understanding (NLU) models trained on synthetic data to assess the quality of synthetic data from different generation approaches.
This work provides an empirical analysis of the impact of these factors and offers recommendations for better data generation practices.
arXiv Detail & Related papers (2024-06-27T21:41:43Z) - Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs)
Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws.
Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z) - Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data.
We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap.
Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z) - Synthetic Data Generation with Large Language Models for Text
Classification: Potential and Limitations [21.583825474908334]
We study how the performance of models trained on synthetic data may vary with the subjectivity of classification.
Our results indicate that subjectivity, at both the task level and instance level, is negatively associated with the performance of the model trained on synthetic data.
arXiv Detail & Related papers (2023-10-11T19:51:13Z) - Synthetic Pre-Training Tasks for Neural Machine Translation [16.6378815054841]
Our goal is to understand the factors that contribute to the effectiveness of pre-training models when using synthetic resources.
We propose several novel approaches to pre-training translation models that involve different levels of lexical and structural knowledge.
Our experiments on multiple language pairs reveal that pre-training benefits can be realized even with high levels of obfuscation or purely synthetic parallel data.
arXiv Detail & Related papers (2022-12-19T21:34:00Z) - Alternated Training with Synthetic and Authentic Data for Neural Machine
Translation [49.35605028467887]
We propose alternated training with synthetic and authentic data for neural machine translation (NMT)
Compared with previous work, we introduce authentic data as guidance to prevent the training of NMT models from being disturbed by noisy synthetic data.
Experiments on Chinese-English and German-English translation tasks show that our approach improves the performance over several strong baselines.
arXiv Detail & Related papers (2021-06-16T07:13:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.