Wasserstein t-SNE
- URL: http://arxiv.org/abs/2205.07531v1
- Date: Mon, 16 May 2022 09:09:24 GMT
- Title: Wasserstein t-SNE
- Authors: Fynn Bachmann, Philipp Hennig, Dmitry Kobak
- Abstract summary: We develop an approach for exploratory analysis of hierarchical datasets using the Wasserstein distance metric.
We use t-SNE to construct 2D embeddings of the units, based on the matrix of pairwise Wasserstein distances between them.
- Score: 25.241296604908424
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Scientific datasets often have hierarchical structure: for example, in
surveys, individual participants (samples) might be grouped at a higher level
(units) such as their geographical region. In these settings, the interest is
often in exploring the structure on the unit level rather than on the sample
level. Units can be compared based on the distance between their means, however
this ignores the within-unit distribution of samples. Here we develop an
approach for exploratory analysis of hierarchical datasets using the
Wasserstein distance metric that takes into account the shapes of within-unit
distributions. We use t-SNE to construct 2D embeddings of the units, based on
the matrix of pairwise Wasserstein distances between them. The distance matrix
can be efficiently computed by approximating each unit with a Gaussian
distribution, but we also provide a scalable method to compute exact
Wasserstein distances. We use synthetic data to demonstrate the effectiveness
of our Wasserstein t-SNE, and apply it to data from the 2017 German
parliamentary election, considering polling stations as samples and voting
districts as units. The resulting embedding uncovers meaningful structure in
the data.
Related papers
- Fused Gromov-Wasserstein Variance Decomposition with Linear Optimal Transport [11.94799054956877]
We present a decomposition of Fr'echet variance of a set of measures in the 2-Wasserstein space, which allows one to compute the percentage of variance explained by LOT embeddings of those measures.
We also present several experiments that explore the relationship between the dimension of the LOT embedding, the percentage of variance explained, and the classification accuracy of machine learning classifiers built on the embedded data.
arXiv Detail & Related papers (2024-11-15T14:10:52Z) - Federated Wasserstein Distance [16.892296712204597]
We introduce a principled way of computing the Wasserstein distance between two distributions in a federated manner.
We show how to estimate the Wasserstein distance between two samples stored and kept on different devices/clients whilst a central entity/server orchestrates the computations.
arXiv Detail & Related papers (2023-10-03T11:30:50Z) - Linearized Wasserstein dimensionality reduction with approximation
guarantees [65.16758672591365]
LOT Wassmap is a computationally feasible algorithm to uncover low-dimensional structures in the Wasserstein space.
We show that LOT Wassmap attains correct embeddings and that the quality improves with increased sample size.
We also show how LOT Wassmap significantly reduces the computational cost when compared to algorithms that depend on pairwise distance computations.
arXiv Detail & Related papers (2023-02-14T22:12:16Z) - Concrete Score Matching: Generalized Score Matching for Discrete Data [109.12439278055213]
"Concrete score" is a generalization of the (Stein) score for discrete settings.
"Concrete Score Matching" is a framework to learn such scores from samples.
arXiv Detail & Related papers (2022-11-02T00:41:37Z) - Density Ratio Estimation via Infinitesimal Classification [85.08255198145304]
We propose DRE-infty, a divide-and-conquer approach to reduce Density ratio estimation (DRE) to a series of easier subproblems.
Inspired by Monte Carlo methods, we smoothly interpolate between the two distributions via an infinite continuum of intermediate bridge distributions.
We show that our approach performs well on downstream tasks such as mutual information estimation and energy-based modeling on complex, high-dimensional datasets.
arXiv Detail & Related papers (2021-11-22T06:26:29Z) - On the use of Wasserstein metric in topological clustering of
distributional data [0.0]
This paper deals with a clustering algorithm for histogram data based on a Self-Organizing Map (SOM) learning.
It combines a dimension reduction by SOM and the clustering of the data in a reduced space.
arXiv Detail & Related papers (2021-09-09T14:27:15Z) - Depth-based pseudo-metrics between probability distributions [1.1470070927586016]
We propose two new pseudo-metrics between continuous probability measures based on data depth and its associated central regions.
In contrast to the Wasserstein distance, the proposed pseudo-metrics do not suffer from the curse of dimensionality.
The regions-based pseudo-metric appears to be robust w.r.t. both outliers and heavy tails.
arXiv Detail & Related papers (2021-03-23T17:33:18Z) - Learning High Dimensional Wasserstein Geodesics [55.086626708837635]
We propose a new formulation and learning strategy for computing the Wasserstein geodesic between two probability distributions in high dimensions.
By applying the method of Lagrange multipliers to the dynamic formulation of the optimal transport (OT) problem, we derive a minimax problem whose saddle point is the Wasserstein geodesic.
We then parametrize the functions by deep neural networks and design a sample based bidirectional learning algorithm for training.
arXiv Detail & Related papers (2021-02-05T04:25:28Z) - Two-sample Test using Projected Wasserstein Distance [18.46110328123008]
We develop a projected Wasserstein distance for the two-sample test, a fundamental problem in statistics and machine learning.
A key contribution is to couple optimal projection to find the low dimensional linear mapping to maximize the Wasserstein distance between projected probability distributions.
arXiv Detail & Related papers (2020-10-22T18:08:58Z) - Augmented Sliced Wasserstein Distances [55.028065567756066]
We propose a new family of distance metrics, called augmented sliced Wasserstein distances (ASWDs)
ASWDs are constructed by first mapping samples to higher-dimensional hypersurfaces parameterized by neural networks.
Numerical results demonstrate that the ASWD significantly outperforms other Wasserstein variants for both synthetic and real-world problems.
arXiv Detail & Related papers (2020-06-15T23:00:08Z) - Fast and Robust Comparison of Probability Measures in Heterogeneous
Spaces [62.35667646858558]
We introduce the Anchor Energy (AE) and Anchor Wasserstein (AW) distances, which are respectively the energy and Wasserstein distances instantiated on such representations.
Our main contribution is to propose a sweep line algorithm to compute AE emphexactly in log-quadratic time, where a naive implementation would be cubic.
We show that AE and AW perform well in various experimental settings at a fraction of the computational cost of popular GW approximations.
arXiv Detail & Related papers (2020-02-05T03:09:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.