Handling Imbalanced Datasets Through Optimum-Path Forest
- URL: http://arxiv.org/abs/2202.08934v1
- Date: Thu, 17 Feb 2022 23:24:49 GMT
- Title: Handling Imbalanced Datasets Through Optimum-Path Forest
- Authors: Leandro Aparecido Passos, Danilo S. Jodas, Luiz C. F. Ribeiro, Marco
Akio, Andre Nunes de Souza, Jo\~ao Paulo Papa
- Abstract summary: Optimum-Path Forest (OPF) has attracted considerable notoriety due to the outstanding performance over many applications.
We propose three OPF-based strategies to deal with the imbalance problem: the $textO2$PF and the OPF-US.
Results compared against several state-of-the-art techniques over public and private datasets confirm the robustness of the proposed approaches.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the last decade, machine learning-based approaches became capable of
performing a wide range of complex tasks sometimes better than humans,
demanding a fraction of the time. Such an advance is partially due to the
exponential growth in the amount of data available, which makes it possible to
extract trustworthy real-world information from them. However, such data is
generally imbalanced since some phenomena are more likely than others. Such a
behavior yields considerable influence on the machine learning model's
performance since it becomes biased on the more frequent data it receives.
Despite the considerable amount of machine learning methods, a graph-based
approach has attracted considerable notoriety due to the outstanding
performance over many applications, i.e., the Optimum-Path Forest (OPF). In
this paper, we propose three OPF-based strategies to deal with the imbalance
problem: the $\text{O}^2$PF and the OPF-US, which are novel approaches for
oversampling and undersampling, respectively, as well as a hybrid strategy
combining both approaches. The paper also introduces a set of variants
concerning the strategies mentioned above. Results compared against several
state-of-the-art techniques over public and private datasets confirm the
robustness of the proposed approaches.
Related papers
- Accelerated Methods with Compressed Communications for Distributed Optimization Problems under Data Similarity [55.03958223190181]
We propose the first theoretically grounded accelerated algorithms utilizing unbiased and biased compression under data similarity.
Our results are of record and confirmed by experiments on different average losses and datasets.
arXiv Detail & Related papers (2024-12-21T00:40:58Z) - Investigating the Impact of Balancing, Filtering, and Complexity on Predictive Multiplicity: A Data-Centric Perspective [5.524804393257921]
Rashomon effect occurs when multiple models achieve similar performance on a dataset but produce different predictions, resulting in predictive multiplicity.
Data-centric AI approaches can mitigate these problems by prioritizing data optimization, particularly through preprocessing techniques.
This paper investigates how data preprocessing techniques like balancing and filtering methods impact predictive multiplicity and model stability, considering the complexity of the data.
arXiv Detail & Related papers (2024-12-12T20:14:45Z) - Train Once for All: A Transitional Approach for Efficient Aspect Sentiment Triplet Extraction [4.372906783600122]
We propose the first transition-based model for AOPE and ASTE that performs aspect and opinion extraction jointly.
By integrating contrastive-augmented optimization, our model delivers more accurate action predictions.
Our model achieves the best performance on both ASTE and AOPE if trained on combined datasets.
arXiv Detail & Related papers (2024-11-29T19:10:41Z) - FedLF: Adaptive Logit Adjustment and Feature Optimization in Federated Long-Tailed Learning [5.23984567704876]
Federated learning offers a paradigm to the challenge of preserving privacy in distributed machine learning.
Traditional approach fails to address the phenomenon of class-wise bias in global long-tailed data.
New method FedLF introduces three modifications in the local training phase: adaptive logit adjustment, continuous class centred optimization, and feature decorrelation.
arXiv Detail & Related papers (2024-09-18T16:25:29Z) - AAA: an Adaptive Mechanism for Locally Differential Private Mean Estimation [42.95927712062214]
Local differential privacy (LDP) is a strong privacy standard that has been adopted by popular software systems.
We propose the advanced adaptive additive (AAA) mechanism, which is a distribution-aware approach that addresses the average utility.
We provide rigorous privacy proofs, utility analyses, and extensive experiments comparing AAA with state-of-the-art mechanisms.
arXiv Detail & Related papers (2024-04-02T04:22:07Z) - Efficient Hybrid Oversampling and Intelligent Undersampling for
Imbalanced Big Data Classification [1.03590082373586]
We present a novel resampling method called SMOTENN that combines intelligent undersampling and oversampling using a MapReduce framework.
Our experimental results show the virtues of this approach, outperforming alternative resampling techniques for small- and medium-sized datasets.
arXiv Detail & Related papers (2023-10-09T15:22:13Z) - Personalized Federated Learning under Mixture of Distributions [98.25444470990107]
We propose a novel approach to Personalized Federated Learning (PFL), which utilizes Gaussian mixture models (GMM) to fit the input data distributions across diverse clients.
FedGMM possesses an additional advantage of adapting to new clients with minimal overhead, and it also enables uncertainty quantification.
Empirical evaluations on synthetic and benchmark datasets demonstrate the superior performance of our method in both PFL classification and novel sample detection.
arXiv Detail & Related papers (2023-05-01T20:04:46Z) - DRFLM: Distributionally Robust Federated Learning with Inter-client
Noise via Local Mixup [58.894901088797376]
federated learning has emerged as a promising approach for training a global model using data from multiple organizations without leaking their raw data.
We propose a general framework to solve the above two challenges simultaneously.
We provide comprehensive theoretical analysis including robustness analysis, convergence analysis, and generalization ability.
arXiv Detail & Related papers (2022-04-16T08:08:29Z) - Learning Distributionally Robust Models at Scale via Composite
Optimization [45.47760229170775]
We show how different variants of DRO are simply instances of a finite-sum composite optimization for which we provide scalable methods.
We also provide empirical results that demonstrate the effectiveness of our proposed algorithm with respect to the prior art in order to learn robust models from very large datasets.
arXiv Detail & Related papers (2022-03-17T20:47:42Z) - CAFE: Learning to Condense Dataset by Aligning Features [72.99394941348757]
We propose a novel scheme to Condense dataset by Aligning FEatures (CAFE)
At the heart of our approach is an effective strategy to align features from the real and synthetic data across various scales.
We validate the proposed CAFE across various datasets, and demonstrate that it generally outperforms the state of the art.
arXiv Detail & Related papers (2022-03-03T05:58:49Z) - Local Learning Matters: Rethinking Data Heterogeneity in Federated
Learning [61.488646649045215]
Federated learning (FL) is a promising strategy for performing privacy-preserving, distributed learning with a network of clients (i.e., edge devices)
arXiv Detail & Related papers (2021-11-28T19:03:39Z) - Scalable Personalised Item Ranking through Parametric Density Estimation [53.44830012414444]
Learning from implicit feedback is challenging because of the difficult nature of the one-class problem.
Most conventional methods use a pairwise ranking approach and negative samplers to cope with the one-class problem.
We propose a learning-to-rank approach, which achieves convergence speed comparable to the pointwise counterpart.
arXiv Detail & Related papers (2021-05-11T03:38:16Z) - On the Benefits of Invariance in Neural Networks [56.362579457990094]
We show that training with data augmentation leads to better estimates of risk and thereof gradients, and we provide a PAC-Bayes generalization bound for models trained with data augmentation.
We also show that compared to data augmentation, feature averaging reduces generalization error when used with convex losses, and tightens PAC-Bayes bounds.
arXiv Detail & Related papers (2020-05-01T02:08:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.