Related papers: Evaluating the Impact of Data Augmentation on Predictive Model Performance

Evaluating the Impact of Data Augmentation on Predictive Model Performance

URL: http://arxiv.org/abs/2412.02108v1
Date: Tue, 03 Dec 2024 03:03:04 GMT
Title: Evaluating the Impact of Data Augmentation on Predictive Model Performance
Authors: Valdemar Švábenský, Conrad Borchers, Elizabeth B. Cloude, Atsushi Shimada,
Abstract summary: This paper systematically compares data augmentation techniques and their impact on prediction performance.<n>Among 21 augmentation techniques, SMOTE-ENN sampling performed the best, improving the average AUC by 0.01.<n>Some augmentation techniques significantly lowered predictive performance or increased performance fluctuation related to random chance.
Score: 0.05624791703748109
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In supervised machine learning (SML) research, large training datasets are essential for valid results. However, obtaining primary data in learning analytics (LA) is challenging. Data augmentation can address this by expanding and diversifying data, though its use in LA remains underexplored. This paper systematically compares data augmentation techniques and their impact on prediction performance in a typical LA task: prediction of academic outcomes. Augmentation is demonstrated on four SML models, which we successfully replicated from a previous LAK study based on AUC values. Among 21 augmentation techniques, SMOTE-ENN sampling performed the best, improving the average AUC by 0.01 and approximately halving the training time compared to the baseline models. In addition, we compared 99 combinations of chaining 21 techniques, and found minor, although statistically significant, improvements across models when adding noise to SMOTE-ENN (+0.014). Notably, some augmentation techniques significantly lowered predictive performance or increased performance fluctuation related to random chance. This paper's contribution is twofold. Primarily, our empirical findings show that sampling techniques provide the most statistically reliable performance improvements for LA applications of SML, and are computationally more efficient than deep generation methods with complex hyperparameter settings. Second, the LA community may benefit from validating a recent study through independent replication.

Related papers

The Impact of Feature Scaling In Machine Learning: Effects on Regression and Classification Tasks [0.8388858079753069]
This research addresses the critical lack of comprehensive studies on feature scaling by systematically evaluating 12 scaling techniques across 14 different Machine Learning algorithms and 16 datasets for classification and regression tasks.<n>We meticulously analyzed impacts on predictive performance (using metrics such as accuracy, MAE, MSE, and $R2$) and computational costs (training time, inference time, and memory usage).
arXiv Detail & Related papers (2025-06-09T22:32:51Z)
Exploring Training and Inference Scaling Laws in Generative Retrieval [50.82554729023865]
We investigate how model size, training data scale, and inference-time compute jointly influence generative retrieval performance. Our experiments show that n-gram-based methods demonstrate strong alignment with both training and inference scaling laws. We find that LLaMA models consistently outperform T5 models, suggesting a particular advantage for larger decoder-only models in generative retrieval.
arXiv Detail & Related papers (2025-03-24T17:59:03Z)
Exploring the Efficacy of Meta-Learning: Unveiling Superior Data Diversity Utilization of MAML Over Pre-training [1.3980986259786223]
We show that dataset diversity can impact the performance of vision models. Our study shows positive correlations between test set accuracy and data diversity. These findings support our hypothesis and demonstrate a promising way for a deeper exploration of how formal data diversity influences model performance.
arXiv Detail & Related papers (2025-01-15T00:56:59Z)
On the Diversity of Synthetic Data and its Impact on Training Large Language Models [34.00031258223175]
Large Language Models (LLMs) have accentuated the need for diverse, high-quality pre-training data. Synthetic data emerges as a viable solution to the challenges of data scarcity and inaccessibility. We study the downstream effects of synthetic data diversity during both the pre-training and fine-tuning stages.
arXiv Detail & Related papers (2024-10-19T22:14:07Z)
Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios. We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples. Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z)
On Improving the Algorithm-, Model-, and Data- Efficiency of Self-Supervised Learning [18.318758111829386]
We propose an efficient single-branch SSL method based on non-parametric instance discrimination. We also propose a novel self-distillation loss that minimizes the KL divergence between the probability distribution and its square root version.
arXiv Detail & Related papers (2024-04-30T06:39:04Z)
Simulation-Enhanced Data Augmentation for Machine Learning Pathloss Prediction [9.664420734674088]
This paper introduces a novel simulation-enhanced data augmentation method for machine learning pathloss prediction. Our method integrates synthetic data generated from a cellular coverage simulator and independently collected real-world datasets. The integration of synthetic data significantly improves the generalizability of the model in different environments.
arXiv Detail & Related papers (2024-02-03T00:38:08Z)
Which Augmentation Should I Use? An Empirical Investigation of Augmentations for Self-Supervised Phonocardiogram Representation Learning [5.438725298163702]
Self-Supervised Learning (SSL) contrastive learning has shown promise in mitigating the issue of data scarcity. Our research aims to explore and evaluate a wide range of audio-based augmentations and uncover combinations that enhance SSL model performance in PCG classification.
arXiv Detail & Related papers (2023-12-01T11:06:00Z)
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models [75.29595679428105]
We investigate how the pre-training loss, supervised data amount, and augmented data amount influence the reasoning performances of a supervised LLM. We find that rejection samples from multiple models push LLaMA-7B to an accuracy of 49.3% on GSM8K which outperforms the supervised fine-tuning (SFT) accuracy of 35.9% significantly.
arXiv Detail & Related papers (2023-08-03T15:34:01Z)
Robust Learning with Progressive Data Expansion Against Spurious Correlation [65.83104529677234]
We study the learning process of a two-layer nonlinear convolutional neural network in the presence of spurious features. Our analysis suggests that imbalanced data groups and easily learnable spurious features can lead to the dominance of spurious features during the learning process. We propose a new training algorithm called PDE that efficiently enhances the model's robustness for a better worst-group performance.
arXiv Detail & Related papers (2023-06-08T05:44:06Z)
Learning Better with Less: Effective Augmentation for Sample-Efficient Visual Reinforcement Learning [57.83232242068982]
Data augmentation (DA) is a crucial technique for enhancing the sample efficiency of visual reinforcement learning (RL) algorithms. It remains unclear which attributes of DA account for its effectiveness in achieving sample-efficient visual RL. This work conducts comprehensive experiments to assess the impact of DA's attributes on its efficacy.
arXiv Detail & Related papers (2023-05-25T15:46:20Z)
To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs. We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting. Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z)
Generative Data Augmentation for Commonsense Reasoning [75.26876609249197]
G-DAUGC is a novel generative data augmentation method that aims to achieve more accurate and robust learning in the low-resource setting. G-DAUGC consistently outperforms existing data augmentation methods based on back-translation. Our analysis demonstrates that G-DAUGC produces a diverse set of fluent training examples, and that its selection and training approaches are important for performance.
arXiv Detail & Related papers (2020-04-24T06:12:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.