Bias-Corrected Data Synthesis for Imbalanced Learning
- URL: http://arxiv.org/abs/2510.26046v1
- Date: Thu, 30 Oct 2025 00:52:25 GMT
- Title: Bias-Corrected Data Synthesis for Imbalanced Learning
- Authors: Pengfei Lyu, Zhengchi Ma, Linjun Zhang, Anru R. Zhang,
- Abstract summary: Imbalanced data, where the positive samples represent only a small proportion compared to the negative samples, makes it challenging for classification problems to balance the false positive and false negative rates.<n>A common approach to addressing the challenge involves generating synthetic data for the minority group and then training classification models with both observed and synthetic data.<n>In this paper, we address the bias introduced by synthetic data and provide consistent estimators for this bias by borrowing information from the majority group.
- Score: 18.33651035966011
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Imbalanced data, where the positive samples represent only a small proportion compared to the negative samples, makes it challenging for classification problems to balance the false positive and false negative rates. A common approach to addressing the challenge involves generating synthetic data for the minority group and then training classification models with both observed and synthetic data. However, since the synthetic data depends on the observed data and fails to replicate the original data distribution accurately, prediction accuracy is reduced when the synthetic data is naively treated as the true data. In this paper, we address the bias introduced by synthetic data and provide consistent estimators for this bias by borrowing information from the majority group. We propose a bias correction procedure to mitigate the adverse effects of synthetic data, enhancing prediction accuracy while avoiding overfitting. This procedure is extended to broader scenarios with imbalanced data, such as imbalanced multi-task learning and causal inference. Theoretical properties, including bounds on bias estimation errors and improvements in prediction accuracy, are provided. Simulation results and data analysis on handwritten digit datasets demonstrate the effectiveness of our method.
Related papers
- Beyond Real Data: Synthetic Data through the Lens of Regularization [9.459299281438074]
Synthetic data can improve generalization when real data is scarce, but excessive reliance may introduce distributional mismatches that degrade performance.<n>We present a learning-theoretic framework to quantify the trade-off between synthetic and real data.
arXiv Detail & Related papers (2025-10-09T11:33:09Z) - Valid Inference with Imperfect Synthetic Data [39.10587411316875]
We introduce a new estimator based on generalized method of moments.<n>We find that interactions between the moment residuals of synthetic data and those of real data can greatly improve estimates of the target parameter.
arXiv Detail & Related papers (2025-08-08T18:32:52Z) - Improving Predictions on Highly Unbalanced Data Using Open Source Synthetic Data Upsampling [0.0]
We show that synthetic data can improve predictive accuracy for minority groups by generating diverse data points that fill gaps in sparse regions of the feature space.<n>We evaluate the effectiveness of an open-source solution, the Synthetic Data SDK by MOSTLY AI, which provides a flexible and user-friendly approach to synthetic upsampling for mixed-type data.
arXiv Detail & Related papers (2025-07-22T10:11:32Z) - Active Data Sampling and Generation for Bias Remediation [0.0]
A mixed active sampling and data generation strategy -- called samplation -- is proposed to compensate during fine-tuning of a pre-trained classifer the unfair classifications it produces.<n>Using as case study Deep Models for visual semantic role labeling, the proposed method has been able to fully cure a simulated gender bias starting from a 90/10 imbalance.
arXiv Detail & Related papers (2025-03-26T10:42:15Z) - Improving Grammatical Error Correction via Contextual Data Augmentation [49.746484518527716]
We propose a synthetic data construction method based on contextual augmentation.
Specifically, we combine rule-based substitution with model-based generation.
We also propose a relabeling-based data cleaning method to mitigate the effects of noisy labels in synthetic data.
arXiv Detail & Related papers (2024-06-25T10:49:56Z) - Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs)
Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws.
Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z) - The Real Deal Behind the Artificial Appeal: Inferential Utility of Tabular Synthetic Data [40.165159490379146]
We show that the rate of false-positive findings (type 1 error) will be unacceptably high, even when the estimates are unbiased.
Despite the use of a previously proposed correction factor, this problem persists for deep generative models.
arXiv Detail & Related papers (2023-12-13T02:04:41Z) - Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data.
We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap.
Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z) - From Fake to Real: Pretraining on Balanced Synthetic Images to Prevent Spurious Correlations in Image Recognition [64.59093444558549]
We propose a simple, easy-to-implement, two-step training pipeline that we call From Fake to Real.
By training on real and synthetic data separately, FFR does not expose the model to the statistical differences between real and synthetic data.
Our experiments show that FFR improves worst group accuracy over the state-of-the-art by up to 20% over three datasets.
arXiv Detail & Related papers (2023-08-08T19:52:28Z) - SYNAuG: Exploiting Synthetic Data for Data Imbalance Problems [39.675787338941184]
This paper explores the potential of synthetic data to address the data imbalance problem.
To be specific, our method, dubbed SYNAuG, leverages synthetic data to equalize the unbalanced distribution of training data.
Our experiments demonstrate that, although a domain gap between real and synthetic data exists, training with SYNAuG followed by fine-tuning with a few real samples allows to achieve impressive performance.
arXiv Detail & Related papers (2023-08-02T07:59:25Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Investigating Bias with a Synthetic Data Generator: Empirical Evidence
and Philosophical Interpretation [66.64736150040093]
Machine learning applications are becoming increasingly pervasive in our society.
Risk is that they will systematically spread the bias embedded in data.
We propose to analyze biases by introducing a framework for generating synthetic data with specific types of bias and their combinations.
arXiv Detail & Related papers (2022-09-13T11:18:50Z) - Balance-Subsampled Stable Prediction [55.13512328954456]
We propose a novel balance-subsampled stable prediction (BSSP) algorithm based on the theory of fractional factorial design.
A design-theoretic analysis shows that the proposed method can reduce the confounding effects among predictors induced by the distribution shift.
Numerical experiments on both synthetic and real-world data sets demonstrate that our BSSP algorithm significantly outperforms the baseline methods for stable prediction across unknown test data.
arXiv Detail & Related papers (2020-06-08T07:01:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.