Related papers: Towards Understanding How Data Augmentation Works with Imbalanced Data

Towards Understanding How Data Augmentation Works with Imbalanced Data

URL: http://arxiv.org/abs/2304.05895v1
Date: Wed, 12 Apr 2023 15:01:22 GMT
Title: Towards Understanding How Data Augmentation Works with Imbalanced Data
Authors: Damien A. Dablain and Nitesh V. Chawla
Abstract summary: We study the effect of data augmentation on three different classifiers, convolutional neural networks, support vector machines, and logistic regression models. Our research indicates that DA, when applied to imbalanced data, produces substantial changes in model weights, support vectors and feature selection. We hypothesize that DA works by facilitating variances in data, so that machine learning models can associate changes in the data with labels.
Score: 17.478900028887537
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Data augmentation forms the cornerstone of many modern machine learning training pipelines; yet, the mechanisms by which it works are not clearly understood. Much of the research on data augmentation (DA) has focused on improving existing techniques, examining its regularization effects in the context of neural network over-fitting, or investigating its impact on features. Here, we undertake a holistic examination of the effect of DA on three different classifiers, convolutional neural networks, support vector machines, and logistic regression models, which are commonly used in supervised classification of imbalanced data. We support our examination with testing on three image and five tabular datasets. Our research indicates that DA, when applied to imbalanced data, produces substantial changes in model weights, support vectors and feature selection; even though it may only yield relatively modest changes to global metrics, such as balanced accuracy or F1 measure. We hypothesize that DA works by facilitating variances in data, so that machine learning models can associate changes in the data with labels. By diversifying the range of feature amplitudes that a model must recognize to predict a label, DA improves a model's capacity to generalize when learning with imbalanced data.

Related papers

Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining [74.83412846804977]
Reinforcement learning (RL)-based fine-tuning has become a crucial step in post-training language models. We present a systematic end-to-end study of RL fine-tuning for mathematical reasoning by training models entirely from scratch.
arXiv Detail & Related papers (2025-04-10T17:15:53Z)
How Does Data Diversity Shape the Weight Landscape of Neural Networks? [2.89287673224661]
We investigate the impact of dropout, weight decay, and noise augmentation on the parameter space of neural networks. We observe that diverse data influences the weight landscape in a similar fashion as dropout. We conclude that synthetic data can bring more diversity into real input data, resulting in a better performance on out-of-distribution test instances.
arXiv Detail & Related papers (2024-10-18T16:57:05Z)
Measuring and Improving Attentiveness to Partial Inputs with Counterfactuals [91.59906995214209]
We propose a new evaluation method, Counterfactual Attentiveness Test (CAT) CAT uses counterfactuals by replacing part of the input with its counterpart from a different example, expecting an attentive model to change its prediction. We show that GPT3 becomes less attentive with an increased number of demonstrations, while its accuracy on the test data improves.
arXiv Detail & Related papers (2023-11-16T06:27:35Z)
The effect of data augmentation and 3D-CNN depth on Alzheimer's Disease detection [51.697248252191265]
This work summarizes and strictly observes best practices regarding data handling, experimental design, and model evaluation. We focus on Alzheimer's Disease (AD) detection, which serves as a paradigmatic example of challenging problem in healthcare. Within this framework, we train predictive 15 models, considering three different data augmentation strategies and five distinct 3D CNN architectures.
arXiv Detail & Related papers (2023-09-13T10:40:41Z)
A Guide for Practical Use of ADMG Causal Data Augmentation [0.0]
Causal data augmentation strategies have been pointed out as a solution to handle these challenges. This paper experimentally analyzed the ADMG causal augmentation method considering different settings.
arXiv Detail & Related papers (2023-04-03T09:31:13Z)
Automatic Data Augmentation via Invariance-Constrained Learning [94.27081585149836]
Underlying data structures are often exploited to improve the solution of learning tasks. Data augmentation induces these symmetries during training by applying multiple transformations to the input data. This work tackles these issues by automatically adapting the data augmentation while solving the learning task.
arXiv Detail & Related papers (2022-09-29T18:11:01Z)
Vector-Based Data Improves Left-Right Eye-Tracking Classifier Performance After a Covariate Distributional Shift [0.0]
We propose a fine-grain data approach for EEG-ET data collection in order to create more robust benchmarking. We train machine learning models utilizing both coarse-grain and fine-grain data and compare their accuracies when tested on data of similar/different distributional patterns. Results showed that models trained on fine-grain, vector-based data were less susceptible to distributional shifts than models trained on coarse-grain, binary-classified data.
arXiv Detail & Related papers (2022-07-31T16:27:50Z)
Using Explainable Boosting Machine to Compare Idiographic and Nomothetic Approaches for Ecological Momentary Assessment Data [2.0824228840987447]
This paper explores the use of non-linear interpretable machine learning (ML) models in classification problems. Various ensembles of trees are compared to linear models using imbalanced synthetic and real-world datasets. In one of the two real-world datasets, knowledge distillation method achieves improved AUC scores.
arXiv Detail & Related papers (2022-04-04T17:56:37Z)
Analyzing the Effects of Handling Data Imbalance on Learned Features from Medical Images by Looking Into the Models [50.537859423741644]
Training a model on an imbalanced dataset can introduce unique challenges to the learning problem. We look deeper into the internal units of neural networks to observe how handling data imbalance affects the learned features.
arXiv Detail & Related papers (2022-04-04T09:38:38Z)
Effect of Balancing Data Using Synthetic Data on the Performance of Machine Learning Classifiers for Intrusion Detection in Computer Networks [3.233545237942899]
Researchers in academia and industry used machine learning (ML) techniques to design and implement Intrusion Detection Systems (IDSes) for computer networks. In many of the datasets used in such systems, data are imbalanced (i.e., not all classes have equal amount of samples) We show that training ML models on dataset balanced with synthetic samples generated by CTGAN increased prediction accuracy by up to $8%$.
arXiv Detail & Related papers (2022-04-01T00:25:11Z)
CMW-Net: Learning a Class-Aware Sample Weighting Mapping for Robust Deep Learning [55.733193075728096]
Modern deep neural networks can easily overfit to biased training data containing corrupted labels or class imbalance. Sample re-weighting methods are popularly used to alleviate this data bias issue. We propose a meta-model capable of adaptively learning an explicit weighting scheme directly from data.
arXiv Detail & Related papers (2022-02-11T13:49:51Z)
Towards Open-World Feature Extrapolation: An Inductive Graph Learning Approach [80.8446673089281]
We propose a new learning paradigm with graph representation and learning. Our framework contains two modules: 1) a backbone network (e.g., feedforward neural nets) as a lower model takes features as input and outputs predicted labels; 2) a graph neural network as an upper model learns to extrapolate embeddings for new features via message passing over a feature-data graph built from observed data.
arXiv Detail & Related papers (2021-10-09T09:02:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.