Gower's similarity coefficients with automatic weight selection
- URL: http://arxiv.org/abs/2401.17041v1
- Date: Tue, 30 Jan 2024 14:21:56 GMT
- Title: Gower's similarity coefficients with automatic weight selection
- Authors: Marcello D'Orazio
- Abstract summary: The most popular dissimilarity for mixed-type variables is derived as the complement to one of the Gower's similarity coefficient.
The discussion on the weighting schemes is sometimes misleading since it often ignores that the unweighted "standard" setting hides an unbalanced contribution of the single variables to the overall dissimilarity.
We address this drawback following the recent idea of introducing a weighting scheme that minimizes the differences in the correlation between each contributing dissimilarity and the resulting weighted Gower's dissimilarity.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Nearest-neighbor methods have become popular in statistics and play a key
role in statistical learning. Important decisions in nearest-neighbor methods
concern the variables to use (when many potential candidates exist) and how to
measure the dissimilarity between units. The first decision depends on the
scope of the application while second depends mainly on the type of variables.
Unfortunately, relatively few options permit to handle mixed-type variables, a
situation frequently encountered in practical applications. The most popular
dissimilarity for mixed-type variables is derived as the complement to one of
the Gower's similarity coefficient. It is appealing because ranges between 0
and 1, being an average of the scaled dissimilarities calculated variable by
variable, handles missing values and allows for a user-defined weighting scheme
when averaging dissimilarities. The discussion on the weighting schemes is
sometimes misleading since it often ignores that the unweighted "standard"
setting hides an unbalanced contribution of the single variables to the overall
dissimilarity. We address this drawback following the recent idea of
introducing a weighting scheme that minimizes the differences in the
correlation between each contributing dissimilarity and the resulting weighted
Gower's dissimilarity. In particular, this note proposes different approaches
for measuring the correlation depending on the type of variables. The
performances of the proposed approaches are evaluated in simulation studies
related to classification and imputation of missing values.
Related papers
- Semiparametric conformal prediction [79.6147286161434]
Risk-sensitive applications require well-calibrated prediction sets over multiple, potentially correlated target variables.
We treat the scores as random vectors and aim to construct the prediction set accounting for their joint correlation structure.
We report desired coverage and competitive efficiency on a range of real-world regression problems.
arXiv Detail & Related papers (2024-11-04T14:29:02Z) - Fractional Naive Bayes (FNB): non-convex optimization for a parsimonious weighted selective naive Bayes classifier [0.0]
We supervised classification for datasets with a very large number of input variables.
We propose a regularization of the model log-like Baylihood.
The various proposed algorithms result in optimization-based weighted na"ivees scheme.
arXiv Detail & Related papers (2024-09-17T11:54:14Z) - Model-independent variable selection via the rule-based variable priority [1.2771542695459488]
We introduce a new model-independent approach, Variable Priority (VarPro)
VarPro works by utilizing rules without the need to generate artificial data or evaluate prediction error.
We show that VarPro has a consistent filtering property for noise variables.
arXiv Detail & Related papers (2024-09-13T17:32:05Z) - Multivariate root-n-consistent smoothing parameter free matching estimators and estimators of inverse density weighted expectations [51.000851088730684]
We develop novel modifications of nearest-neighbor and matching estimators which converge at the parametric $sqrt n $-rate.
We stress that our estimators do not involve nonparametric function estimators and in particular do not rely on sample-size dependent parameters smoothing.
arXiv Detail & Related papers (2024-07-11T13:28:34Z) - Non-parametric Conditional Independence Testing for Mixed
Continuous-Categorical Variables: A Novel Method and Numerical Evaluation [14.993705256147189]
Conditional independence testing (CIT) is a common task in machine learning.
Many real-world applications involve mixed-type datasets that include numerical and categorical variables.
We propose a variation of the former approach that does not treat categorical variables as numeric.
arXiv Detail & Related papers (2023-10-17T10:29:23Z) - Predicting Out-of-Domain Generalization with Neighborhood Invariance [59.05399533508682]
We propose a measure of a classifier's output invariance in a local transformation neighborhood.
Our measure is simple to calculate, does not depend on the test point's true label, and can be applied even in out-of-domain (OOD) settings.
In experiments on benchmarks in image classification, sentiment analysis, and natural language inference, we demonstrate a strong and robust correlation between our measure and actual OOD generalization.
arXiv Detail & Related papers (2022-07-05T14:55:16Z) - Machine Learning for Multi-Output Regression: When should a holistic
multivariate approach be preferred over separate univariate ones? [62.997667081978825]
Tree-based ensembles such as the Random Forest are modern classics among statistical learning methods.
We compare these methods in extensive simulations to help in answering the primary question when to use multivariate ensemble techniques.
arXiv Detail & Related papers (2022-01-14T08:44:25Z) - On the Use of Minimum Penalties in Statistical Learning [2.1320960069210475]
We propose a framework to simultaneously estimate regression coefficients associated with a multivariate regression model and relationships between outcome variables.
An iterative algorithm that generalizes current state art methods is proposed as a solution.
We extend the proposed MinPen framework to other exponential family loss functions, with a specific focus on multiple binomial responses.
arXiv Detail & Related papers (2021-06-09T16:15:46Z) - Distances with mixed type variables some modified Gower's coefficients [0.0]
The choice of the distance function depends mainly on the type of the selected variables.
The most popular distance for mixed type variables is derived as the complement of the Gower's similarity coefficient.
This article tries to address the main drawbacks that affect the overall unweighted Gower's distance.
arXiv Detail & Related papers (2021-01-07T11:00:57Z) - A One-step Approach to Covariate Shift Adaptation [82.01909503235385]
A default assumption in many machine learning scenarios is that the training and test samples are drawn from the same probability distribution.
We propose a novel one-step approach that jointly learns the predictive model and the associated weights in one optimization.
arXiv Detail & Related papers (2020-07-08T11:35:47Z) - Learning from Aggregate Observations [82.44304647051243]
We study the problem of learning from aggregate observations where supervision signals are given to sets of instances.
We present a general probabilistic framework that accommodates a variety of aggregate observations.
Simple maximum likelihood solutions can be applied to various differentiable models.
arXiv Detail & Related papers (2020-04-14T06:18:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.