Interpreting the Curse of Dimensionality from Distance Concentration and
Manifold Effect
- URL: http://arxiv.org/abs/2401.00422v2
- Date: Sun, 7 Jan 2024 14:51:46 GMT
- Title: Interpreting the Curse of Dimensionality from Distance Concentration and
Manifold Effect
- Authors: Dehua Peng, Zhipeng Gui, Huayi Wu
- Abstract summary: We first summarize five challenges associated with manipulating high-dimensional data.
We then delve into two major causes of the curse of dimensionality, distance concentration and manifold effect.
By interpreting the causes of the curse of dimensionality, we can better understand the limitations of current models and algorithms.
- Score: 0.6906005491572401
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The characteristics of data like distribution and heterogeneity, become more
complex and counterintuitive as the dimensionality increases. This phenomenon
is known as curse of dimensionality, where common patterns and relationships
(e.g., internal and boundary pattern) that hold in low-dimensional space may be
invalid in higher-dimensional space. It leads to a decreasing performance for
the regression, classification or clustering models or algorithms. Curse of
dimensionality can be attributed to many causes. In this paper, we first
summarize five challenges associated with manipulating high-dimensional data,
and explains the potential causes for the failure of regression, classification
or clustering tasks. Subsequently, we delve into two major causes of the curse
of dimensionality, distance concentration and manifold effect, by performing
theoretical and empirical analyses. The results demonstrate that nearest
neighbor search (NNS) using three typical distance measurements, Minkowski
distance, Chebyshev distance, and cosine distance, becomes meaningless as the
dimensionality increases. Meanwhile, the data incorporates more redundant
features, and the variance contribution of principal component analysis (PCA)
is skewed towards a few dimensions. By interpreting the causes of the curse of
dimensionality, we can better understand the limitations of current models and
algorithms, and drive to improve the performance of data analysis and machine
learning tasks in high-dimensional space.
Related papers
- Exploring the Meaningfulness of Nearest Neighbor Search in High-Dimensional Space [11.006299554632461]
"Dense high dimensional vectors" are becoming increasingly vital in fields such as computer vision, machine learning, and large language models (LLMs)
Despite the nearest neighbor search (NNS) over these dense high dimensional vectors have been widely used for retrieval augmented generation (RAG) and many other applications.
Our study shows the effectiveness of the embedding-based data representation method and can offer opportunity for further optimization of dense vector-related applications.
arXiv Detail & Related papers (2024-10-08T07:28:17Z) - StackFLOW: Monocular Human-Object Reconstruction by Stacked Normalizing Flow with Offset [56.71580976007712]
We propose to use the Human-Object Offset between anchors which are densely sampled from the surface of human mesh and object mesh to represent human-object spatial relation.
Based on this representation, we propose Stacked Normalizing Flow (StackFLOW) to infer the posterior distribution of human-object spatial relations from the image.
During the optimization stage, we finetune the human body pose and object 6D pose by maximizing the likelihood of samples.
arXiv Detail & Related papers (2024-07-30T04:57:21Z) - Surpassing Cosine Similarity for Multidimensional Comparisons: Dimension Insensitive Euclidean Metric (DIEM) [3.812115031347965]
We introduce the Dimension Insensitive Euclidean Metric (DIEM) which demonstrates superior robustness and generalizability across dimensions.
DIEM maintains consistent variability and eliminates the biases observed in traditional metrics, making it a reliable tool for high-dimensional comparisons.
This novel metric has the potential to replace cosine similarity, providing a more accurate and insightful method to analyze multidimensional data in fields ranging from neuromotor control to machine and deep learning.
arXiv Detail & Related papers (2024-07-11T16:00:22Z) - Enhancing Dimension-Reduced Scatter Plots with Class and Feature Centroids [0.0]
When datasets are reduced to two dimensions, each observation is assigned an x and y coordinates and is represented as a point on a scatter plot.
A significant challenge lies in interpreting the meaning of the x and y axes due to the complexities inherent in dimension reduction.
This study addresses this challenge by using the x and y coordinates derived from dimension reduction to calculate class and feature centroids, which can be overlaid onto the scatter plots.
arXiv Detail & Related papers (2024-03-29T15:45:25Z) - Distributional Reduction: Unifying Dimensionality Reduction and Clustering with Gromov-Wasserstein [56.62376364594194]
Unsupervised learning aims to capture the underlying structure of potentially large and high-dimensional datasets.
In this work, we revisit these approaches under the lens of optimal transport and exhibit relationships with the Gromov-Wasserstein problem.
This unveils a new general framework, called distributional reduction, that recovers DR and clustering as special cases and allows addressing them jointly within a single optimization problem.
arXiv Detail & Related papers (2024-02-03T19:00:19Z) - Scaling Riemannian Diffusion Models [68.52820280448991]
We show that our method enables us to scale to high dimensional tasks on nontrivial manifold.
We model QCD densities on $SU(n)$ lattices and contrastively learned embeddings on high dimensional hyperspheres.
arXiv Detail & Related papers (2023-10-30T21:27:53Z) - Interpretable Linear Dimensionality Reduction based on Bias-Variance
Analysis [45.3190496371625]
We propose a principled dimensionality reduction approach that maintains the interpretability of the resulting features.
In this way, all features are considered, the dimensionality is reduced and the interpretability is preserved.
arXiv Detail & Related papers (2023-03-26T14:30:38Z) - Manifold Topology Divergence: a Framework for Comparing Data Manifolds [109.0784952256104]
We develop a framework for comparing data manifold, aimed at the evaluation of deep generative models.
Based on the Cross-Barcode, we introduce the Manifold Topology Divergence score (MTop-Divergence)
We demonstrate that the MTop-Divergence accurately detects various degrees of mode-dropping, intra-mode collapse, mode invention, and image disturbance.
arXiv Detail & Related papers (2021-06-08T00:30:43Z) - A Local Similarity-Preserving Framework for Nonlinear Dimensionality
Reduction with Neural Networks [56.068488417457935]
We propose a novel local nonlinear approach named Vec2vec for general purpose dimensionality reduction.
To train the neural network, we build the neighborhood similarity graph of a matrix and define the context of data points.
Experiments of data classification and clustering on eight real datasets show that Vec2vec is better than several classical dimensionality reduction methods in the statistical hypothesis test.
arXiv Detail & Related papers (2021-03-10T23:10:47Z) - Multi-point dimensionality reduction to improve projection layout
reliability [77.34726150561087]
In ordinary Dimensionality Reduction (DR), each data instance in an m-dimensional space (original space) is mapped to one point in a d-dimensional space (visual space)
Our solution, named Red Gray Plus, is built upon and extends a combination of ordinary DR and graph drawing techniques.
arXiv Detail & Related papers (2021-01-15T17:17:02Z) - Unsupervised Discretization by Two-dimensional MDL-based Histogram [0.0]
Unsupervised discretization is a crucial step in many knowledge discovery tasks.
We propose an expressive model class that allows for far more flexible partitions of two-dimensional data.
We introduce a algorithm, named PALM, which Partitions each dimension ALternately and then Merges neighboring regions.
arXiv Detail & Related papers (2020-06-02T19:19:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.