Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance
- URL: http://arxiv.org/abs/2406.03628v1
- Date: Wed, 5 Jun 2024 21:24:26 GMT
- Title: Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance
- Authors: Ryumei Nakada, Yichen Xu, Lexin Li, Linjun Zhang,
- Abstract summary: Imbalanced data and spurious correlations are common challenges in machine learning and data science.
Oversampling, which artificially increases the number of instances in the underrepresented classes, has been widely adopted to tackle these challenges.
We introduce OPAL, a systematic oversampling approach that leverages the capabilities of large language models to generate high-quality synthetic data for minority groups.
- Score: 16.047084318753377
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Imbalanced data and spurious correlations are common challenges in machine learning and data science. Oversampling, which artificially increases the number of instances in the underrepresented classes, has been widely adopted to tackle these challenges. In this article, we introduce OPAL (\textbf{O}versam\textbf{P}ling with \textbf{A}rtificial \textbf{L}LM-generated data), a systematic oversampling approach that leverages the capabilities of large language models (LLMs) to generate high-quality synthetic data for minority groups. Recent studies on synthetic data generation using deep generative models mostly target prediction tasks. Our proposal differs in that we focus on handling imbalanced data and spurious correlations. More importantly, we develop a novel theory that rigorously characterizes the benefits of using the synthetic data, and shows the capacity of transformers in generating high-quality synthetic data for both labels and covariates. We further conduct intensive numerical experiments to demonstrate the efficacy of our proposed approach compared to some representative alternative solutions.
Related papers
- On the Diversity of Synthetic Data and its Impact on Training Large Language Models [34.00031258223175]
Large Language Models (LLMs) have accentuated the need for diverse, high-quality pre-training data.
Synthetic data emerges as a viable solution to the challenges of data scarcity and inaccessibility.
We study the downstream effects of synthetic data diversity during both the pre-training and fine-tuning stages.
arXiv Detail & Related papers (2024-10-19T22:14:07Z) - Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs)
Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws.
Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data.
We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap.
Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z) - Does Synthetic Data Make Large Language Models More Efficient? [0.0]
This paper explores the nuances of synthetic data generation in NLP.
We highlight its advantages, including data augmentation potential and the introduction of structured variety.
We demonstrate the impact of template-based synthetic data on the performance of modern transformer models.
arXiv Detail & Related papers (2023-10-11T19:16:09Z) - Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers.
We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes.
We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - CAFE: Learning to Condense Dataset by Aligning Features [72.99394941348757]
We propose a novel scheme to Condense dataset by Aligning FEatures (CAFE)
At the heart of our approach is an effective strategy to align features from the real and synthetic data across various scales.
We validate the proposed CAFE across various datasets, and demonstrate that it generally outperforms the state of the art.
arXiv Detail & Related papers (2022-03-03T05:58:49Z) - Towards Synthetic Multivariate Time Series Generation for Flare
Forecasting [5.098461305284216]
One of the limiting factors in training data-driven, rare-event prediction algorithms is the scarcity of the events of interest.
In this study, we explore the usefulness of the conditional generative adversarial network (CGAN) as a means to perform data-informed oversampling.
arXiv Detail & Related papers (2021-05-16T22:23:23Z) - Foundations of Bayesian Learning from Synthetic Data [1.6249267147413522]
We use a Bayesian paradigm to characterise the updating of model parameters when learning on synthetic data.
Recent results from general Bayesian updating support a novel and robust approach to synthetic-learning founded on decision theory.
arXiv Detail & Related papers (2020-11-16T21:49:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.