Related papers: Asymptotic Normality of Infinite Centered Random Forests -Application to Imbalanced Classification

Asymptotic Normality of Infinite Centered Random Forests -Application to Imbalanced Classification

URL: http://arxiv.org/abs/2506.08548v1
Date: Tue, 10 Jun 2025 08:14:28 GMT
Title: Asymptotic Normality of Infinite Centered Random Forests -Application to Imbalanced Classification
Authors: Moria Mayala, Erwan Scornet, Charles Tillier, Olivier Wintenberger,
Abstract summary: In this paper, we study theoretically such a procedure, when the classifier is a Centered Random Forests (CRF)<n>We prove that the CRF trained on the rebalanced dataset exhibits a bias, which can be removed with appropriate techniques.<n>For high imbalance settings, we prove that the IS-ICRF estimator enjoys a variance reduction compared to the ICRF trained on the original data.
Score: 6.5160087003642
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Many classification tasks involve imbalanced data, in which a class is largely underrepresented. Several techniques consists in creating a rebalanced dataset on which a classifier is trained. In this paper, we study theoretically such a procedure, when the classifier is a Centered Random Forests (CRF). We establish a Central Limit Theorem (CLT) on the infinite CRF with explicit rates and exact constant. We then prove that the CRF trained on the rebalanced dataset exhibits a bias, which can be removed with appropriate techniques. Based on an importance sampling (IS) approach, the resulting debiased estimator, called IS-ICRF, satisfies a CLT centered at the prediction function value. For high imbalance settings, we prove that the IS-ICRF estimator enjoys a variance reduction compared to the ICRF trained on the original data. Therefore, our theoretical analysis highlights the benefits of training random forests on a rebalanced dataset (followed by a debiasing procedure) compared to using the original data. Our theoretical results, especially the variance rates and the variance reduction, appear to be valid for Breiman's random forests in our experiments.

Related papers

Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation [63.180725016463974]
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice. We introduce a novel noisy correspondence learning framework, namely textbfSelf-textbfReinforcing textbfErrors textbfMitigation (SREM)
arXiv Detail & Related papers (2023-12-27T09:03:43Z)
The Lipschitz-Variance-Margin Tradeoff for Enhanced Randomized Smoothing [85.85160896547698]
Real-life applications of deep neural networks are hindered by their unsteady predictions when faced with noisy inputs and adversarial attacks. We show how to design an efficient classifier with a certified radius by relying on noise injection into the inputs. Our novel certification procedure allows us to use pre-trained models with randomized smoothing, effectively improving the current certification radius in a zero-shot manner.
arXiv Detail & Related papers (2023-09-28T22:41:47Z)
Chasing Fairness Under Distribution Shift: A Model Weight Perturbation Approach [72.19525160912943]
We first theoretically demonstrate the inherent connection between distribution shift, data perturbation, and model weight perturbation. We then analyze the sufficient conditions to guarantee fairness for the target dataset. Motivated by these sufficient conditions, we propose robust fairness regularization (RFR)
arXiv Detail & Related papers (2023-03-06T17:19:23Z)
Proposal Distribution Calibration for Few-Shot Object Detection [65.19808035019031]
In few-shot object detection (FSOD), the two-step training paradigm is widely adopted to mitigate the severe sample imbalance. Unfortunately, the extreme data scarcity aggravates the proposal distribution bias, hindering the RoI head from evolving toward novel classes. We introduce a simple yet effective proposal distribution calibration (PDC) approach to neatly enhance the localization and classification abilities of the RoI head.
arXiv Detail & Related papers (2022-12-15T05:09:11Z)
Adaptive Dimension Reduction and Variational Inference for Transductive Few-Shot Classification [2.922007656878633]
We propose a new clustering method based on Variational Bayesian inference, further improved by Adaptive Dimension Reduction. Our proposed method significantly improves accuracy in the realistic unbalanced transductive setting on various Few-Shot benchmarks.
arXiv Detail & Related papers (2022-09-18T10:29:02Z)
Self-Certifying Classification by Linearized Deep Assignment [65.0100925582087]
We propose a novel class of deep predictors for classifying metric data on graphs within PAC-Bayes risk certification paradigm. Building on the recent PAC-Bayes literature and data-dependent priors, this approach enables learning posterior distributions on the hypothesis space.
arXiv Detail & Related papers (2022-01-26T19:59:14Z)
Cross-Domain Empirical Risk Minimization for Unbiased Long-tailed Classification [90.17537630880305]
We address the overlooked unbiasedness in existing long-tailed classification methods. We propose Cross-Domain Empirical Risk Minimization (xERM) for training an unbiased model.
arXiv Detail & Related papers (2021-12-29T03:18:47Z)
Robust Neural Network Classification via Double Regularization [2.41710192205034]
We propose a novel double regularization of the neural network training loss that combines a penalty on the complexity of the classification model and an optimal reweighting of training observations. We demonstrate DRFit, for neural net classification of (i) MNIST and (ii) CIFAR-10, in both cases with simulated mislabeling.
arXiv Detail & Related papers (2021-12-15T13:19:20Z)
Bayesian analysis of the prevalence bias: learning and predicting from imbalanced data [10.659348599372944]
This paper lays the theoretical and computational framework for training models, and for prediction, in the presence of prevalence bias. It offers an alternative to principled training losses and complements test-time procedures based on selecting an operating point from summary curves. It integrates seamlessly in the current paradigm of (deep) learning using backpropagation and naturally with Bayesian models.
arXiv Detail & Related papers (2021-07-31T14:36:33Z)
RB-CCR: Radial-Based Combined Cleaning and Resampling algorithm for imbalanced data classification [5.448684866061922]
Resampling the training data is the standard approach to improving classification performance on imbalanced binary data. RB- CCR exploits the class potential to accurately locate sub-regions of the data-space for synthetic oversampling. Our results show that RB- CCR achieves a better precision-recall trade-off than CCR and generally out-performs the state-of-the-art resampling methods in terms of AUC and G-mean.
arXiv Detail & Related papers (2021-05-09T19:47:45Z)
Estimation and Applications of Quantiles in Deep Binary Classification [0.0]
Quantile regression, based on check loss, is a widely used inferential paradigm in Statistics. We consider the analogue of check loss in the binary classification setting. We develop individualized confidence scores that can be used to decide whether a prediction is reliable.
arXiv Detail & Related papers (2021-02-09T07:07:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.