A Numerical Transform of Random Forest Regressors corrects
Systematically-Biased Predictions
- URL: http://arxiv.org/abs/2003.07445v1
- Date: Mon, 16 Mar 2020 21:18:06 GMT
- Title: A Numerical Transform of Random Forest Regressors corrects
Systematically-Biased Predictions
- Authors: Shipra Malhotra and John Karanicolas
- Abstract summary: We find a systematic bias in predictions from random forest models.
This bias is recapitulated in simple synthetic datasets.
We use the training data to define a numerical transformation that fully corrects it.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Over the past decade, random forest models have become widely used as a
robust method for high-dimensional data regression tasks. In part, the
popularity of these models arises from the fact that they require little
hyperparameter tuning and are not very susceptible to overfitting. Random
forest regression models are comprised of an ensemble of decision trees that
independently predict the value of a (continuous) dependent variable;
predictions from each of the trees are ultimately averaged to yield an overall
predicted value from the forest. Using a suite of representative real-world
datasets, we find a systematic bias in predictions from random forest models.
We find that this bias is recapitulated in simple synthetic datasets,
regardless of whether or not they include irreducible error (noise) in the
data, but that models employing boosting do not exhibit this bias. Here we
demonstrate the basis for this problem, and we use the training data to define
a numerical transformation that fully corrects it. Application of this
transformation yields improved predictions in every one of the real-world and
synthetic datasets evaluated in our study.
Related papers
- Generalized Regression with Conditional GANs [2.4171019220503402]
We propose to learn a prediction function whose outputs, when paired with the corresponding inputs, are indistinguishable from feature-label pairs in the training dataset.
We show that this approach to regression makes fewer assumptions on the distribution of the data we are fitting to and, therefore, has better representation capabilities.
arXiv Detail & Related papers (2024-04-21T01:27:47Z) - Towards Theoretical Understandings of Self-Consuming Generative Models [56.84592466204185]
This paper tackles the emerging challenge of training generative models within a self-consuming loop.
We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models.
We present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
arXiv Detail & Related papers (2024-02-19T02:08:09Z) - Learning Defect Prediction from Unrealistic Data [57.53586547895278]
Pretrained models of code have become popular choices for code understanding and generation tasks.
Such models tend to be large and require commensurate volumes of training data.
It has become popular to train models with far larger but less realistic datasets, such as functions with artificially injected bugs.
Models trained on such data tend to only perform well on similar data, while underperforming on real world programs.
arXiv Detail & Related papers (2023-11-02T01:51:43Z) - Forest Parameter Prediction by Multiobjective Deep Learning of
Regression Models Trained with Pseudo-Target Imputation [6.853936752111048]
In prediction of forest parameters with data from remote sensing, regression models have traditionally been trained on a small sample of ground reference data.
This paper proposes to impute this sample of true prediction targets with data from an existing RS-based prediction map that we consider as pseudo-targets.
We use prediction maps constructed from airborne laser scanning (ALS) data to provide accurate pseudo-targets and free data from Sentinel-1's C-band synthetic aperture radar (SAR) as regressors.
arXiv Detail & Related papers (2023-06-19T18:10:47Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Posterior Collapse and Latent Variable Non-identifiability [54.842098835445]
We propose a class of latent-identifiable variational autoencoders, deep generative models which enforce identifiability without sacrificing flexibility.
Across synthetic and real datasets, latent-identifiable variational autoencoders outperform existing methods in mitigating posterior collapse and providing meaningful representations of the data.
arXiv Detail & Related papers (2023-01-02T06:16:56Z) - X-model: Improving Data Efficiency in Deep Learning with A Minimax Model [78.55482897452417]
We aim at improving data efficiency for both classification and regression setups in deep learning.
To take the power of both worlds, we propose a novel X-model.
X-model plays a minimax game between the feature extractor and task-specific heads.
arXiv Detail & Related papers (2021-10-09T13:56:48Z) - Achieving Reliable Causal Inference with Data-Mined Variables: A Random
Forest Approach to the Measurement Error Problem [1.5749416770494704]
A common empirical strategy involves the application of predictive modeling techniques to'mine' variables of interest from available data.
Recent work highlights that, because the predictions from machine learning models are inevitably imperfect, econometric analyses based on the predicted variables are likely to suffer from bias due to measurement error.
We propose a novel approach to mitigate these biases, leveraging the ensemble learning technique known as the random forest.
arXiv Detail & Related papers (2020-12-19T21:48:23Z) - Deep transformation models: Tackling complex regression problems with
neural network based transformation models [0.0]
We present a deep transformation model for probabilistic regression.
It estimates the whole conditional probability distribution, which is the most thorough way to capture uncertainty about the outcome.
Our method works for complex input data, which we demonstrate by employing a CNN architecture on image data.
arXiv Detail & Related papers (2020-04-01T14:23:12Z) - Censored Quantile Regression Forest [81.9098291337097]
We develop a new estimating equation that adapts to censoring and leads to quantile score whenever the data do not exhibit censoring.
The proposed procedure named it censored quantile regression forest, allows us to estimate quantiles of time-to-event without any parametric modeling assumption.
arXiv Detail & Related papers (2020-01-08T23:20:23Z) - Fr\'echet random forests for metric space valued regression with non
euclidean predictors [0.0]
We introduce Fr'echet trees and Fr'echet random forests, which allow to handle data for which input and output variables take values in general metric spaces.
A consistency theorem for Fr'echet regressogram predictor using data-driven partitions is given and applied to Fr'echet purely uniformly random trees.
arXiv Detail & Related papers (2019-06-04T22:07:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.