Interpreting the Curse of Dimensionality from Distance Concentration and
Manifold Effect
- URL: http://arxiv.org/abs/2401.00422v2
- Date: Sun, 7 Jan 2024 14:51:46 GMT
- Title: Interpreting the Curse of Dimensionality from Distance Concentration and
Manifold Effect
- Authors: Dehua Peng, Zhipeng Gui, Huayi Wu
- Abstract summary: We first summarize five challenges associated with manipulating high-dimensional data.
We then delve into two major causes of the curse of dimensionality, distance concentration and manifold effect.
By interpreting the causes of the curse of dimensionality, we can better understand the limitations of current models and algorithms.
- Score: 0.6906005491572401
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The characteristics of data like distribution and heterogeneity, become more
complex and counterintuitive as the dimensionality increases. This phenomenon
is known as curse of dimensionality, where common patterns and relationships
(e.g., internal and boundary pattern) that hold in low-dimensional space may be
invalid in higher-dimensional space. It leads to a decreasing performance for
the regression, classification or clustering models or algorithms. Curse of
dimensionality can be attributed to many causes. In this paper, we first
summarize five challenges associated with manipulating high-dimensional data,
and explains the potential causes for the failure of regression, classification
or clustering tasks. Subsequently, we delve into two major causes of the curse
of dimensionality, distance concentration and manifold effect, by performing
theoretical and empirical analyses. The results demonstrate that nearest
neighbor search (NNS) using three typical distance measurements, Minkowski
distance, Chebyshev distance, and cosine distance, becomes meaningless as the
dimensionality increases. Meanwhile, the data incorporates more redundant
features, and the variance contribution of principal component analysis (PCA)
is skewed towards a few dimensions. By interpreting the causes of the curse of
dimensionality, we can better understand the limitations of current models and
algorithms, and drive to improve the performance of data analysis and machine
learning tasks in high-dimensional space.
Related papers
- Surpassing Cosine Similarity for Multidimensional Comparisons: Dimension Insensitive Euclidean Metric (DIEM) [3.812115031347965]
We introduce the Dimension Insensitive Euclidean Metric (DIEM), derived from the Euclidean distance.
DIEM maintains consistent variability and eliminates the biases observed in traditional metrics.
This novel metric has the potential to replace cosine similarity, providing a more accurate and insightful method to analyze multidimensional data.
arXiv Detail & Related papers (2024-07-11T16:00:22Z) - Enhancing Dimension-Reduced Scatter Plots with Class and Feature Centroids [0.0]
When datasets are reduced to two dimensions, each observation is assigned an x and y coordinates and is represented as a point on a scatter plot.
A significant challenge lies in interpreting the meaning of the x and y axes due to the complexities inherent in dimension reduction.
This study addresses this challenge by using the x and y coordinates derived from dimension reduction to calculate class and feature centroids, which can be overlaid onto the scatter plots.
arXiv Detail & Related papers (2024-03-29T15:45:25Z) - Distributional Reduction: Unifying Dimensionality Reduction and Clustering with Gromov-Wasserstein [56.62376364594194]
Unsupervised learning aims to capture the underlying structure of potentially large and high-dimensional datasets.
In this work, we revisit these approaches under the lens of optimal transport and exhibit relationships with the Gromov-Wasserstein problem.
This unveils a new general framework, called distributional reduction, that recovers DR and clustering as special cases and allows addressing them jointly within a single optimization problem.
arXiv Detail & Related papers (2024-02-03T19:00:19Z) - Scaling Riemannian Diffusion Models [68.52820280448991]
We show that our method enables us to scale to high dimensional tasks on nontrivial manifold.
We model QCD densities on $SU(n)$ lattices and contrastively learned embeddings on high dimensional hyperspheres.
arXiv Detail & Related papers (2023-10-30T21:27:53Z) - Interpretable Linear Dimensionality Reduction based on Bias-Variance
Analysis [45.3190496371625]
We propose a principled dimensionality reduction approach that maintains the interpretability of the resulting features.
In this way, all features are considered, the dimensionality is reduced and the interpretability is preserved.
arXiv Detail & Related papers (2023-03-26T14:30:38Z) - In search of the most efficient and memory-saving visualization of high
dimensional data [0.0]
We argue that the visualization of multidimensional data is well approximated of two-directed embedding of undimensional nearest-neighbor graphs.
Existing reduction methods are too slow and do not allow interactive manipulation.
We show that high-quality embeddings are produced with minimal time and memory complexity.
arXiv Detail & Related papers (2023-02-27T20:56:13Z) - Towards a mathematical understanding of learning from few examples with
nonlinear feature maps [68.8204255655161]
We consider the problem of data classification where the training set consists of just a few data points.
We reveal key relationships between the geometry of an AI model's feature space, the structure of the underlying data distributions, and the model's generalisation capabilities.
arXiv Detail & Related papers (2022-11-07T14:52:58Z) - Manifold Topology Divergence: a Framework for Comparing Data Manifolds [109.0784952256104]
We develop a framework for comparing data manifold, aimed at the evaluation of deep generative models.
Based on the Cross-Barcode, we introduce the Manifold Topology Divergence score (MTop-Divergence)
We demonstrate that the MTop-Divergence accurately detects various degrees of mode-dropping, intra-mode collapse, mode invention, and image disturbance.
arXiv Detail & Related papers (2021-06-08T00:30:43Z) - A Local Similarity-Preserving Framework for Nonlinear Dimensionality
Reduction with Neural Networks [56.068488417457935]
We propose a novel local nonlinear approach named Vec2vec for general purpose dimensionality reduction.
To train the neural network, we build the neighborhood similarity graph of a matrix and define the context of data points.
Experiments of data classification and clustering on eight real datasets show that Vec2vec is better than several classical dimensionality reduction methods in the statistical hypothesis test.
arXiv Detail & Related papers (2021-03-10T23:10:47Z) - Multi-point dimensionality reduction to improve projection layout
reliability [77.34726150561087]
In ordinary Dimensionality Reduction (DR), each data instance in an m-dimensional space (original space) is mapped to one point in a d-dimensional space (visual space)
Our solution, named Red Gray Plus, is built upon and extends a combination of ordinary DR and graph drawing techniques.
arXiv Detail & Related papers (2021-01-15T17:17:02Z) - Unsupervised Discretization by Two-dimensional MDL-based Histogram [0.0]
Unsupervised discretization is a crucial step in many knowledge discovery tasks.
We propose an expressive model class that allows for far more flexible partitions of two-dimensional data.
We introduce a algorithm, named PALM, which Partitions each dimension ALternately and then Merges neighboring regions.
arXiv Detail & Related papers (2020-06-02T19:19:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.