A Comprehensive Survey on Imbalanced Data Learning
- URL: http://arxiv.org/abs/2502.08960v1
- Date: Thu, 13 Feb 2025 04:53:17 GMT
- Title: A Comprehensive Survey on Imbalanced Data Learning
- Authors: Xinyi Gao, Dongting Xie, Yihang Zhang, Zhengren Wang, Conghui He, Hongzhi Yin, Wentao Zhang,
- Abstract summary: imbalanced data is prevalent in various types of raw data and hinders the performance of machine learning.
This survey systematically analyzes various real-world data formats.
It concludes existing researches for different data formats into four categories: data re-balancing, feature representation, training strategy, and ensemble learning.
- Score: 45.3186824501823
- License:
- Abstract: With the expansion of data availability, machine learning (ML) has achieved remarkable breakthroughs in both academia and industry. However, imbalanced data distributions are prevalent in various types of raw data and severely hinder the performance of ML by biasing the decision-making processes. To deepen the understanding of imbalanced data and facilitate the related research and applications, this survey systematically analyzing various real-world data formats and concludes existing researches for different data formats into four distinct categories: data re-balancing, feature representation, training strategy, and ensemble learning. This structured analysis help researchers comprehensively understand the pervasive nature of imbalance across diverse data format, thereby paving a clearer path toward achieving specific research goals. we provide an overview of relevant open-source libraries, spotlight current challenges, and offer novel insights aimed at fostering future advancements in this critical area of study.
Related papers
- A Survey on Group Fairness in Federated Learning: Challenges, Taxonomy of Solutions and Directions for Future Research [5.08731160761218]
Group fairness in machine learning is a critical area of research focused on achieving equitable outcomes across different groups.
Federated learning amplifies the need for fairness due to the heterogeneous data distributions across clients.
No dedicated survey has focused comprehensively on group fairness in federated learning.
We create a novel taxonomy of these approaches based on key criteria such as data partitioning, location, and applied strategies.
arXiv Detail & Related papers (2024-10-04T18:39:28Z) - Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - A Comprehensive Survey on Data Augmentation [55.355273602421384]
Data augmentation is a technique that generates high-quality artificial data by manipulating existing data samples.
Existing literature surveys only focus on a certain type of specific modality data.
We propose a more enlightening taxonomy that encompasses data augmentation techniques for different common data modalities.
arXiv Detail & Related papers (2024-05-15T11:58:08Z) - Lazy Data Practices Harm Fairness Research [49.02318458244464]
We present a comprehensive analysis of fair ML datasets, demonstrating how unreflective practices hinder the reach and reliability of algorithmic fairness findings.
Our analyses identify three main areas of concern: (1) a textbflack of representation for certain protected attributes in both data and evaluations; (2) the widespread textbf of minorities during data preprocessing; and (3) textbfopaque data processing threatening the generalization of fairness research.
This study underscores the need for a critical reevaluation of data practices in fair ML and offers directions to improve both the sourcing and usage of datasets.
arXiv Detail & Related papers (2024-04-26T09:51:24Z) - A Survey of Deep Long-Tail Classification Advancements [1.6233132273470656]
Many data distributions in the real world are hardly uniform. Instead, skewed and long-tailed distributions of various kinds are commonly observed.
This poses an interesting problem for machine learning, where most algorithms assume or work well with uniformly distributed data.
The problem is further exacerbated by current state-of-the-art deep learning models requiring large volumes of training data.
arXiv Detail & Related papers (2024-04-24T01:59:02Z) - Benchmarking Data Science Agents [11.582116078653968]
Large Language Models (LLMs) have emerged as promising aids as data science agents, assisting humans in data analysis and processing.
Yet their practical efficacy remains constrained by the varied demands of real-world applications and complicated analytical process.
We introduce DSEval -- a novel evaluation paradigm, as well as a series of innovative benchmarks tailored for assessing the performance of these agents.
arXiv Detail & Related papers (2024-02-27T03:03:06Z) - A Survey on Data Selection for Language Models [148.300726396877]
Data selection methods aim to determine which data points to include in a training dataset.
Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive.
Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z) - Supervised Algorithmic Fairness in Distribution Shifts: A Survey [17.826312801085052]
In real-world applications, machine learning models are often trained on a specific dataset but deployed in environments where the data distribution may shift.
This shift can lead to unfair predictions, disproportionately affecting certain groups characterized by sensitive attributes, such as race and gender.
arXiv Detail & Related papers (2024-02-02T11:26:18Z) - Research Trends and Applications of Data Augmentation Algorithms [77.34726150561087]
We identify the main areas of application of data augmentation algorithms, the types of algorithms used, significant research trends, their progression over time and research gaps in data augmentation literature.
We expect readers to understand the potential of data augmentation, as well as identify future research directions and open questions within data augmentation research.
arXiv Detail & Related papers (2022-07-18T11:38:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.