Ranking the information content of distance measures
- URL: http://arxiv.org/abs/2104.15079v1
- Date: Fri, 30 Apr 2021 15:57:57 GMT
- Title: Ranking the information content of distance measures
- Authors: Aldo Glielmo, Claudio Zeni, Bingqing Cheng, Gabor Csanyi, Alessandro
Laio
- Abstract summary: We introduce a statistical test that can assess the relative information retained when using two different distance measures.
This in turn allows finding the most informative distance measure out of a pool of candidates.
- Score: 61.754016309475745
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Real-world data typically contain a large number of features that are often
heterogeneous in nature, relevance, and also units of measure. When assessing
the similarity between data points, one can build various distance measures
using subsets of these features. Using the fewest features but still retaining
sufficient information about the system is crucial in many statistical learning
approaches, particularly when data are sparse. We introduce a statistical test
that can assess the relative information retained when using two different
distance measures, and determine if they are equivalent, independent, or if one
is more informative than the other. This in turn allows finding the most
informative distance measure out of a pool of candidates. The approach is
applied to find the most relevant policy variables for controlling the Covid-19
epidemic and to find compact yet informative representations of atomic
structures, but its potential applications are wide ranging in many branches of
science.
Related papers
- Conformal Disentanglement: A Neural Framework for Perspective Synthesis and Differentiation [0.8192907805418583]
We make observations of objects from several different perspectives in space, at different points in time.
It is necessary to synthesize a complete picture of what is common' across its sources.
We introduce a neural network autoencoder framework capable of both tasks.
arXiv Detail & Related papers (2024-08-27T18:06:45Z) - A Practical Guide to Sample-based Statistical Distances for Evaluating Generative Models in Science [7.2447605934304375]
We focus on four commonly used notions of statistical distances representing different methodologies.
We highlight the intuition behind each distance and explain their merits, scalability, complexity, and pitfalls.
We evaluate generative models from different scientific domains, namely a model of decision-making and a model generating medical images.
arXiv Detail & Related papers (2024-03-19T11:16:14Z) - Estimation of mutual information via quantum kernel method [0.0]
Estimating mutual information (MI) plays a critical role to investigate the relationship among multiple random variables with a nonlinear correlation.
We propose a method for estimating mutual information using the quantum kernel.
arXiv Detail & Related papers (2023-10-19T00:53:16Z) - DCID: Deep Canonical Information Decomposition [84.59396326810085]
We consider the problem of identifying the signal shared between two one-dimensional target variables.
We propose ICM, an evaluation metric which can be used in the presence of ground-truth labels.
We also propose Deep Canonical Information Decomposition (DCID) - a simple, yet effective approach for learning the shared variables.
arXiv Detail & Related papers (2023-06-27T16:59:06Z) - A general framework for implementing distances for categorical variables [0.0]
We introduce a general framework that allows for an efficient and transparent implementation of distances between observations on categorical variables.
Our framework quite naturally leads to the introduction of new distance formulations and allows for the implementation of flexible, case and data specific distance definitions.
In a supervised classification setting, the framework can be used to construct distances that incorporate the association between the response and predictor variables.
arXiv Detail & Related papers (2023-01-04T13:50:08Z) - A Survey of Learning on Small Data: Generalization, Optimization, and
Challenge [101.27154181792567]
Learning on small data that approximates the generalization ability of big data is one of the ultimate purposes of AI.
This survey follows the active sampling theory under a PAC framework to analyze the generalization error and label complexity of learning on small data.
Multiple data applications that may benefit from efficient small data representation are surveyed.
arXiv Detail & Related papers (2022-07-29T02:34:19Z) - Combining Observational and Randomized Data for Estimating Heterogeneous
Treatment Effects [82.20189909620899]
Estimating heterogeneous treatment effects is an important problem across many domains.
Currently, most existing works rely exclusively on observational data.
We propose to estimate heterogeneous treatment effects by combining large amounts of observational data and small amounts of randomized data.
arXiv Detail & Related papers (2022-02-25T18:59:54Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - Gaussianizing the Earth: Multidimensional Information Measures for Earth
Data Analysis [9.464720193746395]
Information theory is an excellent framework for analyzing Earth system data.
It allows us to characterize uncertainty and redundancy, and is universally interpretable.
We show how information theory measures can be applied in various Earth system data analysis problems.
arXiv Detail & Related papers (2020-10-13T15:30:34Z) - Multi-Task Incremental Learning for Object Detection [71.57155077119839]
Multi-task learns multiple tasks, while sharing knowledge and computation among them.
It suffers from catastrophic forgetting of previous knowledge when learned incrementally without access to the old data.
arXiv Detail & Related papers (2020-02-13T04:58:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.