Geometry- and Accuracy-Preserving Random Forest Proximities
- URL: http://arxiv.org/abs/2201.12682v1
- Date: Sat, 29 Jan 2022 23:13:53 GMT
- Title: Geometry- and Accuracy-Preserving Random Forest Proximities
- Authors: Jake S. Rhodes, Adele Cutler, Kevin R. Moon
- Abstract summary: We introduce a novel definition of random forest proximities called Random Forest-Geometry- and Accuracy-Preserving proximities (RF-GAP)
We prove that RF-GAP exactly match the out-of-bag random forest prediction, thus capturing the data geometry learned by the random forest.
This improved geometric representation outperforms traditional random forest proximities in tasks such as data imputation and provides outlier detection and visualization results consistent with the learned data geometry.
- Score: 3.265773263570237
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Random forests are considered one of the best out-of-the-box classification
and regression algorithms due to their high level of predictive performance
with relatively little tuning. Pairwise proximities can be computed from a
trained random forest which measure the similarity between data points relative
to the supervised task. Random forest proximities have been used in many
applications including the identification of variable importance, data
imputation, outlier detection, and data visualization. However, existing
definitions of random forest proximities do not accurately reflect the data
geometry learned by the random forest. In this paper, we introduce a novel
definition of random forest proximities called Random Forest-Geometry- and
Accuracy-Preserving proximities (RF-GAP). We prove that the proximity-weighted
sum (regression) or majority vote (classification) using RF-GAP exactly match
the out-of-bag random forest prediction, thus capturing the data geometry
learned by the random forest. We empirically show that this improved geometric
representation outperforms traditional random forest proximities in tasks such
as data imputation and provides outlier detection and visualization results
consistent with the learned data geometry.
Related papers
- Semiparametric conformal prediction [79.6147286161434]
Risk-sensitive applications require well-calibrated prediction sets over multiple, potentially correlated target variables.
We treat the scores as random vectors and aim to construct the prediction set accounting for their joint correlation structure.
We report desired coverage and competitive efficiency on a range of real-world regression problems.
arXiv Detail & Related papers (2024-11-04T14:29:02Z) - Global Censored Quantile Random Forest [2.8413279736755017]
We propose a Global Censored Quantile Random Forest (GCQRF) for predicting a conditional quantile process on data subject to right censoring.
We quantify the prediction process' variation without assuming an infinite forest and establish its weak convergence.
We demonstrate the superior predictive accuracy of the proposed method over a number of existing alternatives.
arXiv Detail & Related papers (2024-10-16T04:05:01Z) - Minimally Supervised Learning using Topological Projections in
Self-Organizing Maps [55.31182147885694]
We introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs)
Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU)
Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques.
arXiv Detail & Related papers (2024-01-12T22:51:48Z) - Neuroevolution-based Classifiers for Deforestation Detection in Tropical
Forests [62.997667081978825]
Millions of hectares of tropical forests are lost every year due to deforestation or degradation.
Monitoring and deforestation detection programs are in use, in addition to public policies for the prevention and punishment of criminals.
This paper proposes the use of pattern classifiers based on neuroevolution technique (NEAT) in tropical forest deforestation detection tasks.
arXiv Detail & Related papers (2022-08-23T16:04:12Z) - Random Similarity Forests [2.3204178451683264]
We propose a classification method capable of handling datasets with features of arbitrary data types while retaining each feature's characteristic.
The proposed algorithm, called Random Similarity Forest, uses multiple domain-specific distance measures to combine the predictive performance of Random Forests with the flexibility of Similarity Forests.
We show that Random Similarity Forests are on par with Random Forests on numerical data and outperform them on datasets from complex or mixed data domains.
arXiv Detail & Related papers (2022-04-11T20:14:05Z) - MURAL: An Unsupervised Random Forest-Based Embedding for Electronic
Health Record Data [59.26381272149325]
We present an unsupervised random forest for representing data with disparate variable types.
MURAL forests consist of a set of decision trees where node-splitting variables are chosen at random.
We show that using our approach, we can visualize and classify data more accurately than competing approaches.
arXiv Detail & Related papers (2021-11-19T22:02:21Z) - Minimax Rates for High-Dimensional Random Tessellation Forests [0.0]
Mondrian forests is the first class of random forests for which minimax rates were obtained in arbitrary dimension.
We show that a large class of random forests with general split directions also achieve minimax optimal convergence rates in arbitrary dimension.
arXiv Detail & Related papers (2021-09-22T06:47:38Z) - Probabilistic Gradient Boosting Machines for Large-Scale Probabilistic
Regression [51.770998056563094]
Probabilistic Gradient Boosting Machines (PGBM) is a method to create probabilistic predictions with a single ensemble of decision trees.
We empirically demonstrate the advantages of PGBM compared to existing state-of-the-art methods.
arXiv Detail & Related papers (2021-06-03T08:32:13Z) - Improved Weighted Random Forest for Classification Problems [3.42658286826597]
The key to make well-performing ensemble model is in the diversity of the base models.
We propose several algorithms that intend to modify the weighting strategy of regular random forest.
The proposed models are able to introduce significant improvements compared to regular random forest.
arXiv Detail & Related papers (2020-09-01T16:08:45Z) - Censored Quantile Regression Forest [81.9098291337097]
We develop a new estimating equation that adapts to censoring and leads to quantile score whenever the data do not exhibit censoring.
The proposed procedure named it censored quantile regression forest, allows us to estimate quantiles of time-to-event without any parametric modeling assumption.
arXiv Detail & Related papers (2020-01-08T23:20:23Z) - Fr\'echet random forests for metric space valued regression with non
euclidean predictors [0.0]
We introduce Fr'echet trees and Fr'echet random forests, which allow to handle data for which input and output variables take values in general metric spaces.
A consistency theorem for Fr'echet regressogram predictor using data-driven partitions is given and applied to Fr'echet purely uniformly random trees.
arXiv Detail & Related papers (2019-06-04T22:07:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.