Random Similarity Forests
- URL: http://arxiv.org/abs/2204.05389v1
- Date: Mon, 11 Apr 2022 20:14:05 GMT
- Title: Random Similarity Forests
- Authors: Maciej Piernik, Dariusz Brzezinski, Pawel Zawadzki
- Abstract summary: We propose a classification method capable of handling datasets with features of arbitrary data types while retaining each feature's characteristic.
The proposed algorithm, called Random Similarity Forest, uses multiple domain-specific distance measures to combine the predictive performance of Random Forests with the flexibility of Similarity Forests.
We show that Random Similarity Forests are on par with Random Forests on numerical data and outperform them on datasets from complex or mixed data domains.
- Score: 2.3204178451683264
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The wealth of data being gathered about humans and their surroundings drives
new machine learning applications in various fields. Consequently, more and
more often, classifiers are trained using not only numerical data but also
complex data objects. For example, multi-omics analyses attempt to combine
numerical descriptions with distributions, time series data, discrete
sequences, and graphs. Such integration of data from different domains requires
either omitting some of the data, creating separate models for different
formats, or simplifying some of the data to adhere to a shared scale and
format, all of which can hinder predictive performance. In this paper, we
propose a classification method capable of handling datasets with features of
arbitrary data types while retaining each feature's characteristic. The
proposed algorithm, called Random Similarity Forest, uses multiple
domain-specific distance measures to combine the predictive performance of
Random Forests with the flexibility of Similarity Forests. We show that Random
Similarity Forests are on par with Random Forests on numerical data and
outperform them on datasets from complex or mixed data domains. Our results
highlight the applicability of Random Similarity Forests to noisy, multi-source
datasets that are becoming ubiquitous in high-impact life science projects.
Related papers
- Adapt-$\infty$: Scalable Lifelong Multimodal Instruction Tuning via Dynamic Data Selection [89.42023974249122]
Adapt-$infty$ is a new multi-way and adaptive data selection approach for Lifelong Instruction Tuning.
We construct pseudo-skill clusters by grouping gradient-based sample vectors.
We select the best-performing data selector for each skill cluster from a pool of selector experts.
arXiv Detail & Related papers (2024-10-14T15:48:09Z) - Flexible inference in heterogeneous and attributed multilayer networks [21.349513661012498]
We develop a probabilistic generative model to perform inference in multilayer networks with arbitrary types of information.
We demonstrate its ability to unveil a variety of patterns in a social support network among villagers in rural India.
arXiv Detail & Related papers (2024-05-31T15:21:59Z) - Federated Random Forest for Partially Overlapping Clinical Data [0.5062312533373298]
This study aims to address the challenges posed by partially overlapping features and incomplete data in clinical datasets.
For most standard algorithms, like random forest, it is essential that all data sets have identical parameters.
For aggregating the federated, globally optimized model, only features available locally at each site can be used.
arXiv Detail & Related papers (2024-05-31T10:07:24Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - Equivariance Allows Handling Multiple Nuisance Variables When Analyzing
Pooled Neuroimaging Datasets [53.34152466646884]
In this paper, we show how bringing recent results on equivariant representation learning instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution.
We demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
arXiv Detail & Related papers (2022-03-29T04:54:06Z) - Combining Observational and Randomized Data for Estimating Heterogeneous
Treatment Effects [82.20189909620899]
Estimating heterogeneous treatment effects is an important problem across many domains.
Currently, most existing works rely exclusively on observational data.
We propose to estimate heterogeneous treatment effects by combining large amounts of observational data and small amounts of randomized data.
arXiv Detail & Related papers (2022-02-25T18:59:54Z) - Geometry- and Accuracy-Preserving Random Forest Proximities [3.265773263570237]
We introduce a novel definition of random forest proximities called Random Forest-Geometry- and Accuracy-Preserving proximities (RF-GAP)
We prove that RF-GAP exactly match the out-of-bag random forest prediction, thus capturing the data geometry learned by the random forest.
This improved geometric representation outperforms traditional random forest proximities in tasks such as data imputation and provides outlier detection and visualization results consistent with the learned data geometry.
arXiv Detail & Related papers (2022-01-29T23:13:53Z) - MURAL: An Unsupervised Random Forest-Based Embedding for Electronic
Health Record Data [59.26381272149325]
We present an unsupervised random forest for representing data with disparate variable types.
MURAL forests consist of a set of decision trees where node-splitting variables are chosen at random.
We show that using our approach, we can visualize and classify data more accurately than competing approaches.
arXiv Detail & Related papers (2021-11-19T22:02:21Z) - Tell Me Something I Don't Know: Randomization Strategies for Iterative
Data Mining [0.6100370338020054]
We consider the problem of randomizing data so that previously discovered patterns or models are taken into account.
In this paper, we consider the problem of randomizing data so that previously discovered patterns or models are taken into account.
arXiv Detail & Related papers (2020-06-16T19:20:50Z) - Parameter Space Factorization for Zero-Shot Learning across Tasks and
Languages [112.65994041398481]
We propose a Bayesian generative model for the space of neural parameters.
We infer the posteriors over such latent variables based on data from seen task-language combinations.
Our model yields comparable or better results than state-of-the-art, zero-shot cross-lingual transfer methods.
arXiv Detail & Related papers (2020-01-30T16:58:56Z) - Fr\'echet random forests for metric space valued regression with non
euclidean predictors [0.0]
We introduce Fr'echet trees and Fr'echet random forests, which allow to handle data for which input and output variables take values in general metric spaces.
A consistency theorem for Fr'echet regressogram predictor using data-driven partitions is given and applied to Fr'echet purely uniformly random trees.
arXiv Detail & Related papers (2019-06-04T22:07:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.