Depth-based pseudo-metrics between probability distributions
- URL: http://arxiv.org/abs/2103.12711v1
- Date: Tue, 23 Mar 2021 17:33:18 GMT
- Title: Depth-based pseudo-metrics between probability distributions
- Authors: Guillaume Staerman, Pavlo Mozharovskyi, St\'ephan Cl\'emen\c{c}on and
Florence d'Alch\'e-Buc
- Abstract summary: We propose two new pseudo-metrics between continuous probability measures based on data depth and its associated central regions.
In contrast to the Wasserstein distance, the proposed pseudo-metrics do not suffer from the curse of dimensionality.
The regions-based pseudo-metric appears to be robust w.r.t. both outliers and heavy tails.
- Score: 1.1470070927586016
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data depth is a non parametric statistical tool that measures centrality of
any element $x\in\mathbb{R}^d$ with respect to (w.r.t.) a probability
distribution or a data set. It is a natural median-oriented extension of the
cumulative distribution function (cdf) to the multivariate case. Consequently,
its upper level sets -- the depth-trimmed regions -- give rise to a definition
of multivariate quantiles. In this work, we propose two new pseudo-metrics
between continuous probability measures based on data depth and its associated
central regions. The first one is constructed as the Lp-distance between data
depth w.r.t. each distribution while the second one relies on the Hausdorff
distance between their quantile regions. It can further be seen as an original
way to extend the one-dimensional formulae of the Wasserstein distance, which
involves quantiles and cdfs, to the multivariate space. After discussing the
properties of these pseudo-metrics and providing conditions under which they
define a distance, we highlight similarities with the Wasserstein distance.
Interestingly, the derived non-asymptotic bounds show that in contrast to the
Wasserstein distance, the proposed pseudo-metrics do not suffer from the curse
of dimensionality. Moreover, based on the support function of a convex body, we
propose an efficient approximation possessing linear time complexity w.r.t. the
size of the data set and its dimension. The quality of this approximation as
well as the performance of the proposed approach are illustrated in
experiments. Furthermore, by construction the regions-based pseudo-metric
appears to be robust w.r.t. both outliers and heavy tails, a behavior witnessed
in the numerical experiments.
Related papers
- Fast kernel half-space depth for data with non-convex supports [5.725360029813277]
We extend the celebrated halfspace depth to tackle distribution's multimodality.
The proposed depth can be computed using manifold gradient making faster than halfspace depth by several orders of magnitude.
The performance of our depth is demonstrated through numerical simulations as well as applications such as anomaly detection on real data and homogeneity testing.
arXiv Detail & Related papers (2023-12-21T18:55:22Z) - Computing the Distance between unbalanced Distributions -- The flat
Metric [0.0]
The flat metric generalizes the well-known Wasserstein distance W1 to the case that the distributions are of unequal total mass.
The core of the method is based on a neural network to determine on optimal test function realizing the distance between two measures.
arXiv Detail & Related papers (2023-08-02T09:30:22Z) - Energy-Based Sliced Wasserstein Distance [47.18652387199418]
A key component of the sliced Wasserstein (SW) distance is the slicing distribution.
We propose to design the slicing distribution as an energy-based distribution that is parameter-free.
We then derive a novel sliced Wasserstein metric, energy-based sliced Waserstein (EBSW) distance.
arXiv Detail & Related papers (2023-04-26T14:28:45Z) - Linearized Wasserstein dimensionality reduction with approximation
guarantees [65.16758672591365]
LOT Wassmap is a computationally feasible algorithm to uncover low-dimensional structures in the Wasserstein space.
We show that LOT Wassmap attains correct embeddings and that the quality improves with increased sample size.
We also show how LOT Wassmap significantly reduces the computational cost when compared to algorithms that depend on pairwise distance computations.
arXiv Detail & Related papers (2023-02-14T22:12:16Z) - Hilbert Curve Projection Distance for Distribution Comparison [34.8765820950517]
We propose a novel metric, called Hilbert curve projection (HCP) distance, to measure the distance between two probability distributions.
We show that HCP distance is a proper metric and is well-defined for probability measures with bounded supports.
Experiments on both synthetic and real-world data show that our HCP distance works as an effective surrogate of the Wasserstein distance with low complexity.
arXiv Detail & Related papers (2022-05-30T12:40:32Z) - Tangent Space and Dimension Estimation with the Wasserstein Distance [10.118241139691952]
Consider a set of points sampled independently near a smooth compact submanifold of Euclidean space.
We provide mathematically rigorous bounds on the number of sample points required to estimate both the dimension and the tangent spaces of that manifold.
arXiv Detail & Related papers (2021-10-12T21:02:06Z) - Kernel distance measures for time series, random fields and other
structured data [71.61147615789537]
kdiff is a novel kernel-based measure for estimating distances between instances of structured data.
It accounts for both self and cross similarities across the instances and is defined using a lower quantile of the distance distribution.
Some theoretical results are provided for separability conditions using kdiff as a distance measure for clustering and classification problems.
arXiv Detail & Related papers (2021-09-29T22:54:17Z) - Manifold Hypothesis in Data Analysis: Double Geometrically-Probabilistic
Approach to Manifold Dimension Estimation [92.81218653234669]
We present new approach to manifold hypothesis checking and underlying manifold dimension estimation.
Our geometrical method is a modification for sparse data of a well-known box-counting algorithm for Minkowski dimension calculation.
Experiments on real datasets show that the suggested approach based on two methods combination is powerful and effective.
arXiv Detail & Related papers (2021-07-08T15:35:54Z) - Two-sample Test using Projected Wasserstein Distance [18.46110328123008]
We develop a projected Wasserstein distance for the two-sample test, a fundamental problem in statistics and machine learning.
A key contribution is to couple optimal projection to find the low dimensional linear mapping to maximize the Wasserstein distance between projected probability distributions.
arXiv Detail & Related papers (2020-10-22T18:08:58Z) - On Projection Robust Optimal Transport: Sample Complexity and Model
Misspecification [101.0377583883137]
Projection robust (PR) OT seeks to maximize the OT cost between two measures by choosing a $k$-dimensional subspace onto which they can be projected.
Our first contribution is to establish several fundamental statistical properties of PR Wasserstein distances.
Next, we propose the integral PR Wasserstein (IPRW) distance as an alternative to the PRW distance, by averaging rather than optimizing on subspaces.
arXiv Detail & Related papers (2020-06-22T14:35:33Z) - Augmented Sliced Wasserstein Distances [55.028065567756066]
We propose a new family of distance metrics, called augmented sliced Wasserstein distances (ASWDs)
ASWDs are constructed by first mapping samples to higher-dimensional hypersurfaces parameterized by neural networks.
Numerical results demonstrate that the ASWD significantly outperforms other Wasserstein variants for both synthetic and real-world problems.
arXiv Detail & Related papers (2020-06-15T23:00:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.