Random Similarity Forests
- URL: http://arxiv.org/abs/2204.05389v1
- Date: Mon, 11 Apr 2022 20:14:05 GMT
- Title: Random Similarity Forests
- Authors: Maciej Piernik, Dariusz Brzezinski, Pawel Zawadzki
- Abstract summary: We propose a classification method capable of handling datasets with features of arbitrary data types while retaining each feature's characteristic.
The proposed algorithm, called Random Similarity Forest, uses multiple domain-specific distance measures to combine the predictive performance of Random Forests with the flexibility of Similarity Forests.
We show that Random Similarity Forests are on par with Random Forests on numerical data and outperform them on datasets from complex or mixed data domains.
- Score: 2.3204178451683264
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The wealth of data being gathered about humans and their surroundings drives
new machine learning applications in various fields. Consequently, more and
more often, classifiers are trained using not only numerical data but also
complex data objects. For example, multi-omics analyses attempt to combine
numerical descriptions with distributions, time series data, discrete
sequences, and graphs. Such integration of data from different domains requires
either omitting some of the data, creating separate models for different
formats, or simplifying some of the data to adhere to a shared scale and
format, all of which can hinder predictive performance. In this paper, we
propose a classification method capable of handling datasets with features of
arbitrary data types while retaining each feature's characteristic. The
proposed algorithm, called Random Similarity Forest, uses multiple
domain-specific distance measures to combine the predictive performance of
Random Forests with the flexibility of Similarity Forests. We show that Random
Similarity Forests are on par with Random Forests on numerical data and
outperform them on datasets from complex or mixed data domains. Our results
highlight the applicability of Random Similarity Forests to noisy, multi-source
datasets that are becoming ubiquitous in high-impact life science projects.
Related papers
- Flexible inference in heterogeneous and attributed multilayer networks [21.349513661012498]
We develop a probabilistic generative model to perform inference in multilayer networks with arbitrary types of information.
We demonstrate its ability to unveil a variety of patterns in a social support network among villagers in rural India.
arXiv Detail & Related papers (2024-05-31T15:21:59Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - Equivariance Allows Handling Multiple Nuisance Variables When Analyzing
Pooled Neuroimaging Datasets [53.34152466646884]
In this paper, we show how bringing recent results on equivariant representation learning instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution.
We demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
arXiv Detail & Related papers (2022-03-29T04:54:06Z) - Combining Observational and Randomized Data for Estimating Heterogeneous
Treatment Effects [82.20189909620899]
Estimating heterogeneous treatment effects is an important problem across many domains.
Currently, most existing works rely exclusively on observational data.
We propose to estimate heterogeneous treatment effects by combining large amounts of observational data and small amounts of randomized data.
arXiv Detail & Related papers (2022-02-25T18:59:54Z) - Geometry- and Accuracy-Preserving Random Forest Proximities [3.265773263570237]
We introduce a novel definition of random forest proximities called Random Forest-Geometry- and Accuracy-Preserving proximities (RF-GAP)
We prove that RF-GAP exactly match the out-of-bag random forest prediction, thus capturing the data geometry learned by the random forest.
This improved geometric representation outperforms traditional random forest proximities in tasks such as data imputation and provides outlier detection and visualization results consistent with the learned data geometry.
arXiv Detail & Related papers (2022-01-29T23:13:53Z) - MURAL: An Unsupervised Random Forest-Based Embedding for Electronic
Health Record Data [59.26381272149325]
We present an unsupervised random forest for representing data with disparate variable types.
MURAL forests consist of a set of decision trees where node-splitting variables are chosen at random.
We show that using our approach, we can visualize and classify data more accurately than competing approaches.
arXiv Detail & Related papers (2021-11-19T22:02:21Z) - Cross-Cluster Weighted Forests [2.099922236065961]
This article considers the effect of ensembling Random Forest learners trained on clusters within a single dataset with heterogeneity in the distribution of the features.
We find that constructing ensembles of forests trained on clusters determined by algorithms such as k-means results in significant improvements in accuracy and generalizability over the traditional Random Forest algorithm.
arXiv Detail & Related papers (2021-05-17T04:58:29Z) - DAIL: Dataset-Aware and Invariant Learning for Face Recognition [67.4903809903022]
To achieve good performance in face recognition, a large scale training dataset is usually required.
It is problematic and troublesome to naively combine different datasets due to two major issues.
Naively treating the same person as different classes in different datasets during training will affect back-propagation.
manually cleaning labels may take formidable human efforts, especially when there are millions of images and thousands of identities.
arXiv Detail & Related papers (2021-01-14T01:59:52Z) - Tell Me Something I Don't Know: Randomization Strategies for Iterative
Data Mining [0.6100370338020054]
We consider the problem of randomizing data so that previously discovered patterns or models are taken into account.
In this paper, we consider the problem of randomizing data so that previously discovered patterns or models are taken into account.
arXiv Detail & Related papers (2020-06-16T19:20:50Z) - Parameter Space Factorization for Zero-Shot Learning across Tasks and
Languages [112.65994041398481]
We propose a Bayesian generative model for the space of neural parameters.
We infer the posteriors over such latent variables based on data from seen task-language combinations.
Our model yields comparable or better results than state-of-the-art, zero-shot cross-lingual transfer methods.
arXiv Detail & Related papers (2020-01-30T16:58:56Z) - Fr\'echet random forests for metric space valued regression with non
euclidean predictors [0.0]
We introduce Fr'echet trees and Fr'echet random forests, which allow to handle data for which input and output variables take values in general metric spaces.
A consistency theorem for Fr'echet regressogram predictor using data-driven partitions is given and applied to Fr'echet purely uniformly random trees.
arXiv Detail & Related papers (2019-06-04T22:07:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.