Related papers: Random Similarity Forests

Random Similarity Forests

URL: http://arxiv.org/abs/2204.05389v1
Date: Mon, 11 Apr 2022 20:14:05 GMT
Title: Random Similarity Forests
Authors: Maciej Piernik, Dariusz Brzezinski, Pawel Zawadzki
Abstract summary: We propose a classification method capable of handling datasets with features of arbitrary data types while retaining each feature's characteristic. The proposed algorithm, called Random Similarity Forest, uses multiple domain-specific distance measures to combine the predictive performance of Random Forests with the flexibility of Similarity Forests. We show that Random Similarity Forests are on par with Random Forests on numerical data and outperform them on datasets from complex or mixed data domains.
Score: 2.3204178451683264
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The wealth of data being gathered about humans and their surroundings drives new machine learning applications in various fields. Consequently, more and more often, classifiers are trained using not only numerical data but also complex data objects. For example, multi-omics analyses attempt to combine numerical descriptions with distributions, time series data, discrete sequences, and graphs. Such integration of data from different domains requires either omitting some of the data, creating separate models for different formats, or simplifying some of the data to adhere to a shared scale and format, all of which can hinder predictive performance. In this paper, we propose a classification method capable of handling datasets with features of arbitrary data types while retaining each feature's characteristic. The proposed algorithm, called Random Similarity Forest, uses multiple domain-specific distance measures to combine the predictive performance of Random Forests with the flexibility of Similarity Forests. We show that Random Similarity Forests are on par with Random Forests on numerical data and outperform them on datasets from complex or mixed data domains. Our results highlight the applicability of Random Similarity Forests to noisy, multi-source datasets that are becoming ubiquitous in high-impact life science projects.

Related papers

Random Similarity Isolation Forests [1.2845170214324664]
We propose a multi-modal outlier detection algorithm called Random Similarity Isolation Forest. Our method combines the notions of isolation and similarity-based projection to handle datasets with mixtures of features of arbitrary data types.
arXiv Detail & Related papers (2025-02-26T13:22:19Z)
Adapt-$\infty$: Scalable Lifelong Multimodal Instruction Tuning via Dynamic Data Selection [89.42023974249122]
Adapt-$infty$ is a new multi-way and adaptive data selection approach for Lifelong Instruction Tuning. We construct pseudo-skill clusters by grouping gradient-based sample vectors. We select the best-performing data selector for each skill cluster from a pool of selector experts.
arXiv Detail & Related papers (2024-10-14T15:48:09Z)
Flexible inference in heterogeneous and attributed multilayer networks [21.349513661012498]
We develop a probabilistic generative model to perform inference in multilayer networks with arbitrary types of information. We demonstrate its ability to unveil a variety of patterns in a social support network among villagers in rural India.
arXiv Detail & Related papers (2024-05-31T15:21:59Z)
Federated Random Forest for Partially Overlapping Clinical Data [0.5062312533373298]
This study aims to address the challenges posed by partially overlapping features and incomplete data in clinical datasets. For most standard algorithms, like random forest, it is essential that all data sets have identical parameters. For aggregating the federated, globally optimized model, only features available locally at each site can be used.
arXiv Detail & Related papers (2024-05-31T10:07:24Z)
Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis. We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z)
Equivariance Allows Handling Multiple Nuisance Variables When Analyzing Pooled Neuroimaging Datasets [53.34152466646884]
In this paper, we show how bringing recent results on equivariant representation learning instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution. We demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
arXiv Detail & Related papers (2022-03-29T04:54:06Z)
Combining Observational and Randomized Data for Estimating Heterogeneous Treatment Effects [82.20189909620899]
Estimating heterogeneous treatment effects is an important problem across many domains. Currently, most existing works rely exclusively on observational data. We propose to estimate heterogeneous treatment effects by combining large amounts of observational data and small amounts of randomized data.
arXiv Detail & Related papers (2022-02-25T18:59:54Z)
Geometry- and Accuracy-Preserving Random Forest Proximities [3.265773263570237]
We introduce a novel definition of random forest proximities called Random Forest-Geometry- and Accuracy-Preserving proximities (RF-GAP) We prove that RF-GAP exactly match the out-of-bag random forest prediction, thus capturing the data geometry learned by the random forest. This improved geometric representation outperforms traditional random forest proximities in tasks such as data imputation and provides outlier detection and visualization results consistent with the learned data geometry.
arXiv Detail & Related papers (2022-01-29T23:13:53Z)
MURAL: An Unsupervised Random Forest-Based Embedding for Electronic Health Record Data [59.26381272149325]
We present an unsupervised random forest for representing data with disparate variable types. MURAL forests consist of a set of decision trees where node-splitting variables are chosen at random. We show that using our approach, we can visualize and classify data more accurately than competing approaches.
arXiv Detail & Related papers (2021-11-19T22:02:21Z)
DAIL: Dataset-Aware and Invariant Learning for Face Recognition [67.4903809903022]
To achieve good performance in face recognition, a large scale training dataset is usually required. It is problematic and troublesome to naively combine different datasets due to two major issues. Naively treating the same person as different classes in different datasets during training will affect back-propagation. manually cleaning labels may take formidable human efforts, especially when there are millions of images and thousands of identities.
arXiv Detail & Related papers (2021-01-14T01:59:52Z)
Tell Me Something I Don't Know: Randomization Strategies for Iterative Data Mining [0.6100370338020054]
We consider the problem of randomizing data so that previously discovered patterns or models are taken into account. In this paper, we consider the problem of randomizing data so that previously discovered patterns or models are taken into account.
arXiv Detail & Related papers (2020-06-16T19:20:50Z)
Parameter Space Factorization for Zero-Shot Learning across Tasks and Languages [112.65994041398481]
We propose a Bayesian generative model for the space of neural parameters. We infer the posteriors over such latent variables based on data from seen task-language combinations. Our model yields comparable or better results than state-of-the-art, zero-shot cross-lingual transfer methods.
arXiv Detail & Related papers (2020-01-30T16:58:56Z)
Fr\'echet random forests for metric space valued regression with non euclidean predictors [0.0]
We introduce Fr'echet trees and Fr'echet random forests, which allow to handle data for which input and output variables take values in general metric spaces. A consistency theorem for Fr'echet regressogram predictor using data-driven partitions is given and applied to Fr'echet purely uniformly random trees.
arXiv Detail & Related papers (2019-06-04T22:07:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.