A Novel Intrinsic Measure of Data Separability
- URL: http://arxiv.org/abs/2109.05180v1
- Date: Sat, 11 Sep 2021 04:20:08 GMT
- Title: A Novel Intrinsic Measure of Data Separability
- Authors: Shuyue Guan, Murray Loew
- Abstract summary: In machine learning, the performance of a classifier depends on the separability/complexity of datasets.
We create an intrinsic measure -- the Distance-based Separability Index (DSI)
We show that the DSI can indicate whether the distributions of datasets are identical for any dimensionality.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In machine learning, the performance of a classifier depends on both the
classifier model and the separability/complexity of datasets. To quantitatively
measure the separability of datasets, we create an intrinsic measure -- the
Distance-based Separability Index (DSI), which is independent of the classifier
model. We consider the situation in which different classes of data are mixed
in the same distribution to be the most difficult for classifiers to separate.
We then formally show that the DSI can indicate whether the distributions of
datasets are identical for any dimensionality. And we verify the DSI to be an
effective separability measure by comparing to several state-of-the-art
separability/complexity measures using synthetic and real datasets. Having
demonstrated the DSI's ability to compare distributions of samples, we also
discuss some of its other promising applications, such as measuring the
performance of generative adversarial networks (GANs) and evaluating the
results of clustering methods.
Related papers
- Synergistic eigenanalysis of covariance and Hessian matrices for enhanced binary classification [72.77513633290056]
We present a novel approach that combines the eigenanalysis of a covariance matrix evaluated on a training set with a Hessian matrix evaluated on a deep learning model.
Our method captures intricate patterns and relationships, enhancing classification performance.
arXiv Detail & Related papers (2024-02-14T16:10:42Z) - Exploring Hierarchical Classification Performance for Time Series Data:
Dissimilarity Measures and Classifier Comparisons [0.0]
This study investigates the comparative performance of hierarchical classification (HC) and flat classification (FC) methodologies in time series data analysis.
Dissimilarity measures, including Jensen-Shannon Distance (JSD), Task Similarity Distance (TSD), and Based Distance (CBD) are leveraged.
arXiv Detail & Related papers (2024-02-07T21:46:26Z) - A structured regression approach for evaluating model performance across intersectional subgroups [53.91682617836498]
Disaggregated evaluation is a central task in AI fairness assessment, where the goal is to measure an AI system's performance across different subgroups.
We introduce a structured regression approach to disaggregated evaluation that we demonstrate can yield reliable system performance estimates even for very small subgroups.
arXiv Detail & Related papers (2024-01-26T14:21:45Z) - Revisiting the Evaluation of Image Synthesis with GANs [55.72247435112475]
This study presents an empirical investigation into the evaluation of synthesis performance, with generative adversarial networks (GANs) as a representative of generative models.
In particular, we make in-depth analyses of various factors, including how to represent a data point in the representation space, how to calculate a fair distance using selected samples, and how many instances to use from each set.
arXiv Detail & Related papers (2023-04-04T17:54:32Z) - A classification performance evaluation measure considering data
separability [6.751026374812737]
We propose a new separability measure--the rate of separability (RS)--based on the data coding rate.
We demonstrate the positive correlation between the proposed measure and recognition accuracy in a multi-task scenario constructed from a real dataset.
arXiv Detail & Related papers (2022-11-10T09:18:26Z) - Using Representation Expressiveness and Learnability to Evaluate
Self-Supervised Learning Methods [61.49061000562676]
We introduce Cluster Learnability (CL) to assess learnability.
CL is measured in terms of the performance of a KNN trained to predict labels obtained by clustering the representations with K-means.
We find that CL better correlates with in-distribution model performance than other competing recent evaluation schemes.
arXiv Detail & Related papers (2022-06-02T19:05:13Z) - Equivariance Allows Handling Multiple Nuisance Variables When Analyzing
Pooled Neuroimaging Datasets [53.34152466646884]
In this paper, we show how bringing recent results on equivariant representation learning instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution.
We demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
arXiv Detail & Related papers (2022-03-29T04:54:06Z) - Effective Data-aware Covariance Estimator from Compressed Data [63.16042585506435]
We propose a data-aware weighted sampling based covariance matrix estimator, namely DACE, which can provide an unbiased covariance matrix estimation.
We conduct extensive experiments on both synthetic and real-world datasets to demonstrate the superior performance of our DACE.
arXiv Detail & Related papers (2020-10-10T10:10:28Z) - Data Separability for Neural Network Classifiers and the Development of
a Separability Index [17.49709034278995]
We created the Distance-based Separability Index (DSI) to measure the separability of datasets.
We show that the DSI can indicate whether data belonging to different classes have similar distributions.
We also discussed possible applications of the DSI in the fields of data science, machine learning, and deep learning.
arXiv Detail & Related papers (2020-05-27T01:49:19Z) - Learning Similarity Metrics for Numerical Simulations [29.39625644221578]
We propose a neural network-based approach that computes a stable and generalizing metric (LSiM) to compare data from a variety of numerical simulation sources.
Our method employs a Siamese network architecture that is motivated by the mathematical properties of a metric.
arXiv Detail & Related papers (2020-02-18T20:11:15Z) - TCMI: a non-parametric mutual-dependence estimator for multivariate
continuous distributions [0.0]
Total cumulative mutual information (TCMI) is a measure of the relevance of mutual dependences.
TCMI is a non-parametric, robust, and deterministic measure that facilitates comparisons and rankings between feature sets.
arXiv Detail & Related papers (2020-01-30T08:42:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.