Enhancing Image Classification in Small and Unbalanced Datasets through Synthetic Data Augmentation
- URL: http://arxiv.org/abs/2409.10286v2
- Date: Tue, 1 Oct 2024 11:08:24 GMT
- Title: Enhancing Image Classification in Small and Unbalanced Datasets through Synthetic Data Augmentation
- Authors: Neil De La Fuente, Mireia Majó, Irina Luzko, Henry Córdova, Gloria Fernández-Esparrach, Jorge Bernal,
- Abstract summary: This paper introduces a novel synthetic augmentation strategy using class-specific Variational Autoencoders (VAEs) and latent space to improve discrimination capabilities.
By generating realistic, varied synthetic data that fills feature space gaps, we address issues of data scarcity and class imbalance.
The proposed strategy was tested in a small dataset of 321 images created to train and validate an automatic method for assessing the quality of cleanliness of esophagogastroduodenoscopy images.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Accurate and robust medical image classification is a challenging task, especially in application domains where available annotated datasets are small and present high imbalance between target classes. Considering that data acquisition is not always feasible, especially for underrepresented classes, our approach introduces a novel synthetic augmentation strategy using class-specific Variational Autoencoders (VAEs) and latent space interpolation to improve discrimination capabilities. By generating realistic, varied synthetic data that fills feature space gaps, we address issues of data scarcity and class imbalance. The method presented in this paper relies on the interpolation of latent representations within each class, thus enriching the training set and improving the model's generalizability and diagnostic accuracy. The proposed strategy was tested in a small dataset of 321 images created to train and validate an automatic method for assessing the quality of cleanliness of esophagogastroduodenoscopy images. By combining real and synthetic data, an increase of over 18\% in the accuracy of the most challenging underrepresented class was observed. The proposed strategy not only benefited the underrepresented class but also led to a general improvement in other metrics, including a 6\% increase in global accuracy and precision.
Related papers
- TSynD: Targeted Synthetic Data Generation for Enhanced Medical Image Classification [0.011037620731410175]
This work aims to guide the generative model to synthesize data with high uncertainty.
We alter the feature space of the autoencoder through an optimization process.
We improve the robustness against test time data augmentations and adversarial attacks on several classifications tasks.
arXiv Detail & Related papers (2024-06-25T11:38:46Z) - Provable Optimization for Adversarial Fair Self-supervised Contrastive Learning [49.417414031031264]
This paper studies learning fair encoders in a self-supervised learning setting.
All data are unlabeled and only a small portion of them are annotated with sensitive attributes.
arXiv Detail & Related papers (2024-06-09T08:11:12Z) - DALSA: Domain Adaptation for Supervised Learning From Sparsely Annotated
MR Images [2.352695945685781]
We propose a new method that employs transfer learning techniques to correct sampling selection errors introduced by sparse annotations during supervised learning for automated tumor segmentation.
The proposed method derives high-quality classifiers for the different tissue classes from sparse and unambiguous annotations.
Compared to training on fully labeled data, we reduced the time for labeling and training by a factor greater than 70 and 180 respectively without sacrificing accuracy.
arXiv Detail & Related papers (2024-03-12T09:17:21Z) - Synthetic Augmentation with Large-scale Unconditional Pre-training [4.162192894410251]
We propose a synthetic augmentation method called HistoDiffusion to reduce the dependency on annotated data.
HistoDiffusion can be pre-trained on large-scale unlabeled datasets and later applied to a small-scale labeled dataset for augmented training.
We evaluate our proposed method by pre-training on three histopathology datasets and testing on a histopathology dataset of colorectal cancer (CRC) excluded from the pre-training datasets.
arXiv Detail & Related papers (2023-08-08T03:34:04Z) - Consistency Regularization for Generalizable Source-free Domain
Adaptation [62.654883736925456]
Source-free domain adaptation (SFDA) aims to adapt a well-trained source model to an unlabelled target domain without accessing the source dataset.
Existing SFDA methods ONLY assess their adapted models on the target training set, neglecting the data from unseen but identically distributed testing sets.
We propose a consistency regularization framework to develop a more generalizable SFDA method.
arXiv Detail & Related papers (2023-08-03T07:45:53Z) - SSL-CPCD: Self-supervised learning with composite pretext-class
discrimination for improved generalisability in endoscopic image analysis [3.1542695050861544]
Deep learning-based supervised methods are widely popular in medical image analysis.
They require a large amount of training data and face issues in generalisability to unseen datasets.
We propose to explore patch-level instance-group discrimination and penalisation of inter-class variation using additive angular margin.
arXiv Detail & Related papers (2023-05-31T21:28:08Z) - Classification of datasets with imputed missing values: does imputation
quality matter? [2.7646249774183]
Classifying samples in incomplete datasets is non-trivial.
We demonstrate how the commonly used measures for assessing quality are flawed.
We propose a new class of discrepancy scores which focus on how well the method recreates the overall distribution of the data.
arXiv Detail & Related papers (2022-06-16T22:58:03Z) - Imposing Consistency for Optical Flow Estimation [73.53204596544472]
Imposing consistency through proxy tasks has been shown to enhance data-driven learning.
This paper introduces novel and effective consistency strategies for optical flow estimation.
arXiv Detail & Related papers (2022-04-14T22:58:30Z) - CAFE: Learning to Condense Dataset by Aligning Features [72.99394941348757]
We propose a novel scheme to Condense dataset by Aligning FEatures (CAFE)
At the heart of our approach is an effective strategy to align features from the real and synthetic data across various scales.
We validate the proposed CAFE across various datasets, and demonstrate that it generally outperforms the state of the art.
arXiv Detail & Related papers (2022-03-03T05:58:49Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - Supercharging Imbalanced Data Learning With Energy-based Contrastive
Representation Transfer [72.5190560787569]
In computer vision, learning from long tailed datasets is a recurring theme, especially for natural image datasets.
Our proposal posits a meta-distributional scenario, where the data generating mechanism is invariant across the label-conditional feature distributions.
This allows us to leverage a causal data inflation procedure to enlarge the representation of minority classes.
arXiv Detail & Related papers (2020-11-25T00:13:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.