Distances with mixed type variables some modified Gower's coefficients
- URL: http://arxiv.org/abs/2101.02481v1
- Date: Thu, 7 Jan 2021 11:00:57 GMT
- Title: Distances with mixed type variables some modified Gower's coefficients
- Authors: Marcello D'Orazio
- Abstract summary: The choice of the distance function depends mainly on the type of the selected variables.
The most popular distance for mixed type variables is derived as the complement of the Gower's similarity coefficient.
This article tries to address the main drawbacks that affect the overall unweighted Gower's distance.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Nearest neighbor methods have become popular in official statistics, mainly
in imputation or in statistical matching problems; they play a key role in
machine learning too, where a high number of variants have been proposed. The
choice of the distance function depends mainly on the type of the selected
variables. Unfortunately, relatively few options permit to handle mixed type
variables, a situation frequently encountered in official statistics. The most
popular distance for mixed type variables is derived as the complement of the
Gower's similarity coefficient; it is appealing because ranges between 0 and 1
and allows to handle missing values. Unfortunately, the unweighted standard
setting the contribution of the single variables to the overall Gower's
distance is unbalanced because of the different nature of the variables
themselves. This article tries to address the main drawbacks that affect the
overall unweighted Gower's distance by suggesting some modifications in
calculating the distance on the interval and ratio scaled variables. Simple
modifications try to attenuate the impact of outliers on the scaled Manhattan
distance; other modifications, relying on the kernel density estimation methods
attempt to reduce the unbalanced contribution of the different types of
variables. The performance of the proposals is evaluated in simulations
mimicking the imputation of missing values through nearest neighbor distance
hotdeck method.
Related papers
- Semiparametric conformal prediction [79.6147286161434]
Risk-sensitive applications require well-calibrated prediction sets over multiple, potentially correlated target variables.
We treat the scores as random vectors and aim to construct the prediction set accounting for their joint correlation structure.
We report desired coverage and competitive efficiency on a range of real-world regression problems.
arXiv Detail & Related papers (2024-11-04T14:29:02Z) - Model-independent variable selection via the rule-based variable priority [1.2771542695459488]
We introduce a new model-independent approach, Variable Priority (VarPro)
VarPro works by utilizing rules without the need to generate artificial data or evaluate prediction error.
We show that VarPro has a consistent filtering property for noise variables.
arXiv Detail & Related papers (2024-09-13T17:32:05Z) - Multivariate root-n-consistent smoothing parameter free matching estimators and estimators of inverse density weighted expectations [51.000851088730684]
We develop novel modifications of nearest-neighbor and matching estimators which converge at the parametric $sqrt n $-rate.
We stress that our estimators do not involve nonparametric function estimators and in particular do not rely on sample-size dependent parameters smoothing.
arXiv Detail & Related papers (2024-07-11T13:28:34Z) - Gower's similarity coefficients with automatic weight selection [0.0]
The most popular dissimilarity for mixed-type variables is derived as the complement to one of the Gower's similarity coefficient.
The discussion on the weighting schemes is sometimes misleading since it often ignores that the unweighted "standard" setting hides an unbalanced contribution of the single variables to the overall dissimilarity.
We address this drawback following the recent idea of introducing a weighting scheme that minimizes the differences in the correlation between each contributing dissimilarity and the resulting weighted Gower's dissimilarity.
arXiv Detail & Related papers (2024-01-30T14:21:56Z) - Non-parametric Conditional Independence Testing for Mixed
Continuous-Categorical Variables: A Novel Method and Numerical Evaluation [14.993705256147189]
Conditional independence testing (CIT) is a common task in machine learning.
Many real-world applications involve mixed-type datasets that include numerical and categorical variables.
We propose a variation of the former approach that does not treat categorical variables as numeric.
arXiv Detail & Related papers (2023-10-17T10:29:23Z) - Confidence-Based Model Selection: When to Take Shortcuts for
Subpopulation Shifts [119.22672589020394]
We propose COnfidence-baSed MOdel Selection (CosMoS), where model confidence can effectively guide model selection.
We evaluate CosMoS on four datasets with spurious correlations, each with multiple test sets with varying levels of data distribution shift.
arXiv Detail & Related papers (2023-06-19T18:48:15Z) - Dual-sPLS: a family of Dual Sparse Partial Least Squares regressions for
feature selection and prediction with tunable sparsity; evaluation on
simulated and near-infrared (NIR) data [1.6099403809839032]
The variant presented in this paper, Dual-sPLS, generalizes the classical PLS1 algorithm.
It provides balance between accurate prediction and efficient interpretation.
Code is provided as an open-source package in R.
arXiv Detail & Related papers (2023-01-17T21:50:35Z) - VarCLR: Variable Semantic Representation Pre-training via Contrastive
Learning [84.70916463298109]
VarCLR is a new approach for learning semantic representations of variable names.
VarCLR is an excellent fit for contrastive learning, which aims to minimize the distance between explicitly similar inputs.
We show that VarCLR enables the effective application of sophisticated, general-purpose language models like BERT.
arXiv Detail & Related papers (2021-12-05T18:40:32Z) - Double Control Variates for Gradient Estimation in Discrete Latent
Variable Models [32.33171301923846]
We introduce a variance reduction technique for score function estimators.
We show that our estimator can have lower variance compared to other state-of-the-art estimators.
arXiv Detail & Related papers (2021-11-09T18:02:42Z) - Rao-Blackwellizing the Straight-Through Gumbel-Softmax Gradient
Estimator [93.05919133288161]
We show that the variance of the straight-through variant of the popular Gumbel-Softmax estimator can be reduced through Rao-Blackwellization.
This provably reduces the mean squared error.
We empirically demonstrate that this leads to variance reduction, faster convergence, and generally improved performance in two unsupervised latent variable models.
arXiv Detail & Related papers (2020-10-09T22:54:38Z) - SUMO: Unbiased Estimation of Log Marginal Probability for Latent
Variable Models [80.22609163316459]
We introduce an unbiased estimator of the log marginal likelihood and its gradients for latent variable models based on randomized truncation of infinite series.
We show that models trained using our estimator give better test-set likelihoods than a standard importance-sampling based approach for the same average computational cost.
arXiv Detail & Related papers (2020-04-01T11:49:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.