The Exploitation of Distance Distributions for Clustering
- URL: http://arxiv.org/abs/2108.09649v1
- Date: Sun, 22 Aug 2021 06:22:08 GMT
- Title: The Exploitation of Distance Distributions for Clustering
- Authors: Michael C. Thrun
- Abstract summary: In cluster analysis, different properties for distance distributions are judged to be relevant for appropriate distance selection.
By systematically investigating this specification using distribution analysis through a mirrored-density plot, it is shown that multimodal distance distributions are preferable in cluster analysis.
Experiments are performed on several artificial datasets and natural datasets for the task of clustering.
- Score: 3.42658286826597
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although distance measures are used in many machine learning algorithms, the
literature on the context-independent selection and evaluation of distance
measures is limited in the sense that prior knowledge is used. In cluster
analysis, current studies evaluate the choice of distance measure after
applying unsupervised methods based on error probabilities, implicitly setting
the goal of reproducing predefined partitions in data. Such studies use
clusters of data that are often based on the context of the data as well as the
custom goal of the specific study. Depending on the data context, different
properties for distance distributions are judged to be relevant for appropriate
distance selection. However, if cluster analysis is based on the task of
finding similar partitions of data, then the intrapartition distances should be
smaller than the interpartition distances. By systematically investigating this
specification using distribution analysis through a mirrored-density plot, it
is shown that multimodal distance distributions are preferable in cluster
analysis. As a consequence, it is advantageous to model distance distributions
with Gaussian mixtures prior to the evaluation phase of unsupervised methods.
Experiments are performed on several artificial datasets and natural datasets
for the task of clustering.
Related papers
- Collaborative Heterogeneous Causal Inference Beyond Meta-analysis [68.4474531911361]
We propose a collaborative inverse propensity score estimator for causal inference with heterogeneous data.
Our method shows significant improvements over the methods based on meta-analysis when heterogeneity increases.
arXiv Detail & Related papers (2024-04-24T09:04:36Z) - A Distribution-Based Threshold for Determining Sentence Similarity [0.0]
We present a solution to a semantic textual similarity (STS) problem in which it is necessary to match two sentences containing, as the only distinguishing factor, highly specific information.
The solution revolves around the use of a neural network, based on the siamese architecture, to create the distributions of the distances between similar and dissimilar pairs of sentences.
arXiv Detail & Related papers (2023-11-28T10:42:35Z) - Computing the Distance between unbalanced Distributions -- The flat
Metric [0.0]
The flat metric generalizes the well-known Wasserstein distance W1 to the case that the distributions are of unequal total mass.
The core of the method is based on a neural network to determine on optimal test function realizing the distance between two measures.
arXiv Detail & Related papers (2023-08-02T09:30:22Z) - Approximating Counterfactual Bounds while Fusing Observational, Biased
and Randomised Data Sources [64.96984404868411]
We address the problem of integrating data from multiple, possibly biased, observational and interventional studies.
We show that the likelihood of the available data has no local maxima.
We then show how the same approach can address the general case of multiple datasets.
arXiv Detail & Related papers (2023-07-31T11:28:24Z) - Score Approximation, Estimation and Distribution Recovery of Diffusion
Models on Low-Dimensional Data [68.62134204367668]
This paper studies score approximation, estimation, and distribution recovery of diffusion models, when data are supported on an unknown low-dimensional linear subspace.
We show that with a properly chosen neural network architecture, the score function can be both accurately approximated and efficiently estimated.
The generated distribution based on the estimated score function captures the data geometric structures and converges to a close vicinity of the data distribution.
arXiv Detail & Related papers (2023-02-14T17:02:35Z) - A new nonparametric interpoint distance-based measure for assessment of
clustering [0.0]
A new interpoint distance-based measure is proposed to identify the optimal number of clusters present in a data set.
Our proposed criterion is compatible with any clustering algorithm, and can be used to determine the unknown number of clusters.
arXiv Detail & Related papers (2022-10-01T04:27:54Z) - Anomaly Clustering: Grouping Images into Coherent Clusters of Anomaly
Types [60.45942774425782]
We introduce anomaly clustering, whose goal is to group data into coherent clusters of anomaly types.
This is different from anomaly detection, whose goal is to divide anomalies from normal data.
We present a simple yet effective clustering framework using a patch-based pretrained deep embeddings and off-the-shelf clustering methods.
arXiv Detail & Related papers (2021-12-21T23:11:33Z) - Kernel distance measures for time series, random fields and other
structured data [71.61147615789537]
kdiff is a novel kernel-based measure for estimating distances between instances of structured data.
It accounts for both self and cross similarities across the instances and is defined using a lower quantile of the distance distribution.
Some theoretical results are provided for separability conditions using kdiff as a distance measure for clustering and classification problems.
arXiv Detail & Related papers (2021-09-29T22:54:17Z) - Graph Embedding with Data Uncertainty [113.39838145450007]
spectral-based subspace learning is a common data preprocessing step in many machine learning pipelines.
Most subspace learning methods do not take into consideration possible measurement inaccuracies or artifacts that can lead to data with high uncertainty.
arXiv Detail & Related papers (2020-09-01T15:08:23Z) - On Cokriging, Neural Networks, and Spatial Blind Source Separation for
Multivariate Spatial Prediction [3.416170716497814]
Blind source separation is a pre-processing tool for spatial prediction.
In this paper we investigate the use of spatial blind source separation as a pre-processing tool for spatial prediction.
We compare it with predictions from Cokriging and neural networks in an extensive simulation study as well as a geochemical dataset.
arXiv Detail & Related papers (2020-07-01T10:59:45Z) - Adaptive Discrete Smoothing for High-Dimensional and Nonlinear Panel
Data [4.550919471480445]
We develop a data-driven smoothing technique for high-dimensional and non-linear panel data models.
The weights are determined by a data-driven way and depend on the similarity between the corresponding functions.
We conduct a simulation study which shows that the prediction can be greatly improved by using our estimator.
arXiv Detail & Related papers (2019-12-30T09:50:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.