A general framework for implementing distances for categorical variables
- URL: http://arxiv.org/abs/2301.02190v1
- Date: Wed, 4 Jan 2023 13:50:08 GMT
- Title: A general framework for implementing distances for categorical variables
- Authors: Michel van de Velden and Alfonso Iodice D'Enza and Angelos Markos and
Carlo Cavicchia
- Abstract summary: We introduce a general framework that allows for an efficient and transparent implementation of distances between observations on categorical variables.
Our framework quite naturally leads to the introduction of new distance formulations and allows for the implementation of flexible, case and data specific distance definitions.
In a supervised classification setting, the framework can be used to construct distances that incorporate the association between the response and predictor variables.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The degree to which subjects differ from each other with respect to certain
properties measured by a set of variables, plays an important role in many
statistical methods. For example, classification, clustering, and data
visualization methods all require a quantification of differences in the
observed values. We can refer to the quantification of such differences, as
distance. An appropriate definition of a distance depends on the nature of the
data and the problem at hand. For distances between numerical variables, there
exist many definitions that depend on the size of the observed differences. For
categorical data, the definition of a distance is more complex, as there is no
straightforward quantification of the size of the observed differences.
Consequently, many proposals exist that can be used to measure differences
based on categorical variables. In this paper, we introduce a general framework
that allows for an efficient and transparent implementation of distances
between observations on categorical variables. We show that several existing
distances can be incorporated into the framework. Moreover, our framework quite
naturally leads to the introduction of new distance formulations and allows for
the implementation of flexible, case and data specific distance definitions.
Furthermore, in a supervised classification setting, the framework can be used
to construct distances that incorporate the association between the response
and predictor variables and hence improve the performance of distance-based
classifiers.
Related papers
- Graph-based Virtual Sensing from Sparse and Partial Multivariate
Observations [22.567497617912046]
We introduce a novel graph-based methodology to exploit such relationships and design a graph deep learning architecture, named GgNet, implementing the framework.
The proposed approach relies on propagating information over a nested graph structure that is used to learn dependencies between variables as well as locations.
GgNet is extensively evaluated under different virtual sensing scenarios, demonstrating higher reconstruction accuracy compared to the state-of-the-art.
arXiv Detail & Related papers (2024-02-19T23:22:30Z) - Gower's similarity coefficients with automatic weight selection [0.0]
The most popular dissimilarity for mixed-type variables is derived as the complement to one of the Gower's similarity coefficient.
The discussion on the weighting schemes is sometimes misleading since it often ignores that the unweighted "standard" setting hides an unbalanced contribution of the single variables to the overall dissimilarity.
We address this drawback following the recent idea of introducing a weighting scheme that minimizes the differences in the correlation between each contributing dissimilarity and the resulting weighted Gower's dissimilarity.
arXiv Detail & Related papers (2024-01-30T14:21:56Z) - Non-parametric Conditional Independence Testing for Mixed
Continuous-Categorical Variables: A Novel Method and Numerical Evaluation [14.993705256147189]
Conditional independence testing (CIT) is a common task in machine learning.
Many real-world applications involve mixed-type datasets that include numerical and categorical variables.
We propose a variation of the former approach that does not treat categorical variables as numeric.
arXiv Detail & Related papers (2023-10-17T10:29:23Z) - Enriching Disentanglement: From Logical Definitions to Quantitative Metrics [59.12308034729482]
Disentangling the explanatory factors in complex data is a promising approach for data-efficient representation learning.
We establish relationships between logical definitions and quantitative metrics to derive theoretically grounded disentanglement metrics.
We empirically demonstrate the effectiveness of the proposed metrics by isolating different aspects of disentangled representations.
arXiv Detail & Related papers (2023-05-19T08:22:23Z) - Kernel distance measures for time series, random fields and other
structured data [71.61147615789537]
kdiff is a novel kernel-based measure for estimating distances between instances of structured data.
It accounts for both self and cross similarities across the instances and is defined using a lower quantile of the distance distribution.
Some theoretical results are provided for separability conditions using kdiff as a distance measure for clustering and classification problems.
arXiv Detail & Related papers (2021-09-29T22:54:17Z) - Disentanglement Analysis with Partial Information Decomposition [31.56299813238937]
disentangled representations aim at reversing the process by mapping data to multiple random variables that individually capture distinct generative factors.
Current disentanglement metrics are designed to measure the concentration, e.g., absolute deviation, variance, or entropy, of each variable conditioned by each generative factor.
In this work, we use the Partial Information Decomposition framework to evaluate information sharing between more than two variables, and build a framework, including a new disentanglement metric.
arXiv Detail & Related papers (2021-08-31T11:09:40Z) - Ranking the information content of distance measures [61.754016309475745]
We introduce a statistical test that can assess the relative information retained when using two different distance measures.
This in turn allows finding the most informative distance measure out of a pool of candidates.
arXiv Detail & Related papers (2021-04-30T15:57:57Z) - Interpretable Multi-dataset Evaluation for Named Entity Recognition [110.64368106131062]
We present a general methodology for interpretable evaluation for the named entity recognition (NER) task.
The proposed evaluation method enables us to interpret the differences in models and datasets, as well as the interplay between them.
By making our analysis tool available, we make it easy for future researchers to run similar analyses and drive progress in this area.
arXiv Detail & Related papers (2020-11-13T10:53:27Z) - Learning Disentangled Representations with Latent Variation
Predictability [102.4163768995288]
This paper defines the variation predictability of latent disentangled representations.
Within an adversarial generation process, we encourage variation predictability by maximizing the mutual information between latent variations and corresponding image pairs.
We develop an evaluation metric that does not rely on the ground-truth generative factors to measure the disentanglement of latent representations.
arXiv Detail & Related papers (2020-07-25T08:54:26Z) - Neural Methods for Point-wise Dependency Estimation [129.93860669802046]
We focus on estimating point-wise dependency (PD), which quantitatively measures how likely two outcomes co-occur.
We demonstrate the effectiveness of our approaches in 1) MI estimation, 2) self-supervised representation learning, and 3) cross-modal retrieval task.
arXiv Detail & Related papers (2020-06-09T23:26:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.