Related papers: Wasserstein t-SNE

Wasserstein t-SNE

URL: http://arxiv.org/abs/2205.07531v1
Date: Mon, 16 May 2022 09:09:24 GMT
Title: Wasserstein t-SNE
Authors: Fynn Bachmann, Philipp Hennig, Dmitry Kobak
Abstract summary: We develop an approach for exploratory analysis of hierarchical datasets using the Wasserstein distance metric. We use t-SNE to construct 2D embeddings of the units, based on the matrix of pairwise Wasserstein distances between them.
Score: 25.241296604908424
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Scientific datasets often have hierarchical structure: for example, in surveys, individual participants (samples) might be grouped at a higher level (units) such as their geographical region. In these settings, the interest is often in exploring the structure on the unit level rather than on the sample level. Units can be compared based on the distance between their means, however this ignores the within-unit distribution of samples. Here we develop an approach for exploratory analysis of hierarchical datasets using the Wasserstein distance metric that takes into account the shapes of within-unit distributions. We use t-SNE to construct 2D embeddings of the units, based on the matrix of pairwise Wasserstein distances between them. The distance matrix can be efficiently computed by approximating each unit with a Gaussian distribution, but we also provide a scalable method to compute exact Wasserstein distances. We use synthetic data to demonstrate the effectiveness of our Wasserstein t-SNE, and apply it to data from the 2017 German parliamentary election, considering polling stations as samples and voting districts as units. The resulting embedding uncovers meaningful structure in the data.

Related papers

Wasserstein Distances Made Explainable: Insights into Dataset Shifts and Transport Phenomena [3.4991519098475843]
Wasserstein distances provide a powerful framework for comparing data distributions.<n>We propose a novel solution based on Explainable AI that allows us to efficiently and accurately attribute Wasserstein distances to various data components.
arXiv Detail & Related papers (2025-05-09T15:26:38Z)
Sublinear Algorithms for Wasserstein and Total Variation Distances: Applications to Fairness and Privacy Auditing [7.81603404636933]
We propose a generic algorithmic framework to estimate the PDF and CDF of any sub-Gaussian distribution while the samples from them arrive in a stream. We compute mergeable summaries of distributions from the stream of samples that require sublinear space w.r.t. the number of observed samples. This allows us to estimate Wasserstein and Total Variation (TV) distances between any two sub-Gaussian while samples arrive in streams and from multiple sources.
arXiv Detail & Related papers (2025-03-10T18:57:48Z)
Fused Gromov-Wasserstein Variance Decomposition with Linear Optimal Transport [11.94799054956877]
We present a decomposition of Fr'echet variance of a set of measures in the 2-Wasserstein space, which allows one to compute the percentage of variance explained by LOT embeddings of those measures. We also present several experiments that explore the relationship between the dimension of the LOT embedding, the percentage of variance explained, and the classification accuracy of machine learning classifiers built on the embedded data.
arXiv Detail & Related papers (2024-11-15T14:10:52Z)
Federated Wasserstein Distance [16.892296712204597]
We introduce a principled way of computing the Wasserstein distance between two distributions in a federated manner. We show how to estimate the Wasserstein distance between two samples stored and kept on different devices/clients whilst a central entity/server orchestrates the computations.
arXiv Detail & Related papers (2023-10-03T11:30:50Z)
Linearized Wasserstein dimensionality reduction with approximation guarantees [65.16758672591365]
LOT Wassmap is a computationally feasible algorithm to uncover low-dimensional structures in the Wasserstein space. We show that LOT Wassmap attains correct embeddings and that the quality improves with increased sample size. We also show how LOT Wassmap significantly reduces the computational cost when compared to algorithms that depend on pairwise distance computations.
arXiv Detail & Related papers (2023-02-14T22:12:16Z)
Concrete Score Matching: Generalized Score Matching for Discrete Data [109.12439278055213]
"Concrete score" is a generalization of the (Stein) score for discrete settings. "Concrete Score Matching" is a framework to learn such scores from samples.
arXiv Detail & Related papers (2022-11-02T00:41:37Z)
Density Ratio Estimation via Infinitesimal Classification [85.08255198145304]
We propose DRE-infty, a divide-and-conquer approach to reduce Density ratio estimation (DRE) to a series of easier subproblems. Inspired by Monte Carlo methods, we smoothly interpolate between the two distributions via an infinite continuum of intermediate bridge distributions. We show that our approach performs well on downstream tasks such as mutual information estimation and energy-based modeling on complex, high-dimensional datasets.
arXiv Detail & Related papers (2021-11-22T06:26:29Z)
On the use of Wasserstein metric in topological clustering of distributional data [0.0]
This paper deals with a clustering algorithm for histogram data based on a Self-Organizing Map (SOM) learning. It combines a dimension reduction by SOM and the clustering of the data in a reduced space.
arXiv Detail & Related papers (2021-09-09T14:27:15Z)
Depth-based pseudo-metrics between probability distributions [1.1470070927586016]
We propose two new pseudo-metrics between continuous probability measures based on data depth and its associated central regions. In contrast to the Wasserstein distance, the proposed pseudo-metrics do not suffer from the curse of dimensionality. The regions-based pseudo-metric appears to be robust w.r.t. both outliers and heavy tails.
arXiv Detail & Related papers (2021-03-23T17:33:18Z)
Learning High Dimensional Wasserstein Geodesics [55.086626708837635]
We propose a new formulation and learning strategy for computing the Wasserstein geodesic between two probability distributions in high dimensions. By applying the method of Lagrange multipliers to the dynamic formulation of the optimal transport (OT) problem, we derive a minimax problem whose saddle point is the Wasserstein geodesic. We then parametrize the functions by deep neural networks and design a sample based bidirectional learning algorithm for training.
arXiv Detail & Related papers (2021-02-05T04:25:28Z)
Two-sample Test using Projected Wasserstein Distance [18.46110328123008]
We develop a projected Wasserstein distance for the two-sample test, a fundamental problem in statistics and machine learning. A key contribution is to couple optimal projection to find the low dimensional linear mapping to maximize the Wasserstein distance between projected probability distributions.
arXiv Detail & Related papers (2020-10-22T18:08:58Z)
Augmented Sliced Wasserstein Distances [55.028065567756066]
We propose a new family of distance metrics, called augmented sliced Wasserstein distances (ASWDs) ASWDs are constructed by first mapping samples to higher-dimensional hypersurfaces parameterized by neural networks. Numerical results demonstrate that the ASWD significantly outperforms other Wasserstein variants for both synthetic and real-world problems.
arXiv Detail & Related papers (2020-06-15T23:00:08Z)
Fast and Robust Comparison of Probability Measures in Heterogeneous Spaces [62.35667646858558]
We introduce the Anchor Energy (AE) and Anchor Wasserstein (AW) distances, which are respectively the energy and Wasserstein distances instantiated on such representations. Our main contribution is to propose a sweep line algorithm to compute AE emphexactly in log-quadratic time, where a naive implementation would be cubic. We show that AE and AW perform well in various experimental settings at a fraction of the computational cost of popular GW approximations.
arXiv Detail & Related papers (2020-02-05T03:09:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.