A Survey of Methods for Handling Disk Data Imbalance
- URL: http://arxiv.org/abs/2310.08867v1
- Date: Fri, 13 Oct 2023 05:35:13 GMT
- Title: A Survey of Methods for Handling Disk Data Imbalance
- Authors: Shuangshuang Yuan, Peng Wu, Yuehui Chen and Qiang Li
- Abstract summary: This paper provides a comprehensive overview of research in the field of imbalanced data classification.
The Backblaze dataset, a widely used dataset related to hard discs, has a small amount of failure data and a large amount of health data, which exhibits a serious class imbalance.
- Score: 10.261915886145214
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Class imbalance exists in many classification problems, and since the data is
designed for accuracy, imbalance in data classes can lead to classification
challenges with a few classes having higher misclassification costs. The
Backblaze dataset, a widely used dataset related to hard discs, has a small
amount of failure data and a large amount of health data, which exhibits a
serious class imbalance. This paper provides a comprehensive overview of
research in the field of imbalanced data classification. The discussion is
organized into three main aspects: data-level methods, algorithmic-level
methods, and hybrid methods. For each type of method, we summarize and analyze
the existing problems, algorithmic ideas, strengths, and weaknesses.
Additionally, the challenges of unbalanced data classification are discussed,
along with strategies to address them. It is convenient for researchers to
choose the appropriate method according to their needs.
Related papers
- A Survey of Deep Long-Tail Classification Advancements [1.6233132273470656]
Many data distributions in the real world are hardly uniform. Instead, skewed and long-tailed distributions of various kinds are commonly observed.
This poses an interesting problem for machine learning, where most algorithms assume or work well with uniformly distributed data.
The problem is further exacerbated by current state-of-the-art deep learning models requiring large volumes of training data.
arXiv Detail & Related papers (2024-04-24T01:59:02Z) - Data-level hybrid strategy selection for disk fault prediction model
based on multivariate GAN [7.270429986841776]
Data class imbalance is a common problem in classification problems, where minority class samples are often more important and more costly to misclassify.
The SMART dataset exhibits an evident class imbalance, comprising a substantial quantity of healthy samples and a comparatively limited number of defective samples.
This dataset serves as a reliable indicator of the disc's health status.
arXiv Detail & Related papers (2023-10-10T11:34:53Z) - Neural Collapse Terminus: A Unified Solution for Class Incremental
Learning and Its Variants [166.916517335816]
In this paper, we offer a unified solution to the misalignment dilemma in the three tasks.
We propose neural collapse terminus that is a fixed structure with the maximal equiangular inter-class separation for the whole label space.
Our method holds the neural collapse optimality in an incremental fashion regardless of data imbalance or data scarcity.
arXiv Detail & Related papers (2023-08-03T13:09:59Z) - Revisiting Long-tailed Image Classification: Survey and Benchmarks with
New Evaluation Metrics [88.39382177059747]
A corpus of metrics is designed for measuring the accuracy, robustness, and bounds of algorithms for learning with long-tailed distribution.
Based on our benchmarks, we re-evaluate the performance of existing methods on CIFAR10 and CIFAR100 datasets.
arXiv Detail & Related papers (2023-02-03T02:40:54Z) - Balanced Split: A new train-test data splitting strategy for imbalanced
datasets [0.0]
Class imbalance is a problem since most machine learning algorithms are built with an assumption of equal representation of all classes in the training dataset.
This paper shows a new way to counter the class imbalance problem through a new data-splitting strategy called balanced split.
arXiv Detail & Related papers (2022-12-17T10:36:39Z) - A Survey of Methods for Addressing Class Imbalance in Deep-Learning
Based Natural Language Processing [68.37496795076203]
We provide guidance for NLP researchers and practitioners dealing with imbalanced data.
We first discuss various types of controlled and real-world class imbalance.
We organize the methods by whether they are based on sampling, data augmentation, choice of loss function, staged learning, or model design.
arXiv Detail & Related papers (2022-10-10T13:26:40Z) - A Hybrid Approach for Binary Classification of Imbalanced Data [0.0]
We propose HADR, a hybrid approach with dimension reduction that consists of data block construction, dimentionality reduction, and ensemble learning.
We evaluate the performance on eight imbalanced public datasets in terms of recall, G-mean, and AUC.
arXiv Detail & Related papers (2022-07-06T15:18:41Z) - CMW-Net: Learning a Class-Aware Sample Weighting Mapping for Robust Deep
Learning [55.733193075728096]
Modern deep neural networks can easily overfit to biased training data containing corrupted labels or class imbalance.
Sample re-weighting methods are popularly used to alleviate this data bias issue.
We propose a meta-model capable of adaptively learning an explicit weighting scheme directly from data.
arXiv Detail & Related papers (2022-02-11T13:49:51Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - Long-Tailed Recognition Using Class-Balanced Experts [128.73438243408393]
We propose an ensemble of class-balanced experts that combines the strength of diverse classifiers.
Our ensemble of class-balanced experts reaches results close to state-of-the-art and an extended ensemble establishes a new state-of-the-art on two benchmarks for long-tailed recognition.
arXiv Detail & Related papers (2020-04-07T20:57:44Z) - Smart Data driven Decision Trees Ensemble Methodology for Imbalanced Big
Data [11.117880929232575]
Split data strategies and lack of data in minority class due to the use of MapReduce paradigm have posed new challenges for tackling imbalanced data problems.
Smart Data refers to data of enough quality to achieve high performance models.
We propose a novel Smart Data driven Decision Trees Ensemble methodology for addressing the imbalanced classification problem in Big Data domains.
arXiv Detail & Related papers (2020-01-16T12:25:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.