OPDR: Order-Preserving Dimension Reduction for Semantic Embedding of Multimodal Scientific Data
- URL: http://arxiv.org/abs/2408.10264v1
- Date: Thu, 15 Aug 2024 22:30:44 GMT
- Title: OPDR: Order-Preserving Dimension Reduction for Semantic Embedding of Multimodal Scientific Data
- Authors: Chengyu Gong, Gefei Shen, Luanzheng Guo, Nathan Tallent, Dongfang Zhao,
- Abstract summary: One of the most common operations in multimodal scientific data management is searching for the $k$ most similar items.
The dimension of the resulting embedding vectors are usually on the order of hundreds or a thousand, which are impractically high for time-sensitive scientific applications.
This work proposes to reduce the dimensionality of the output embedding vectors such that the set of top-$k$ nearest neighbors do not change in the lower-dimensional space.
- Score: 0.888375168590583
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: One of the most common operations in multimodal scientific data management is searching for the $k$ most similar items (or, $k$-nearest neighbors, KNN) from the database after being provided a new item. Although recent advances of multimodal machine learning models offer a \textit{semantic} index, the so-called \textit{embedding vectors} mapped from the original multimodal data, the dimension of the resulting embedding vectors are usually on the order of hundreds or a thousand, which are impractically high for time-sensitive scientific applications. This work proposes to reduce the dimensionality of the output embedding vectors such that the set of top-$k$ nearest neighbors do not change in the lower-dimensional space, namely Order-Preserving Dimension Reduction (OPDR). In order to develop such an OPDR method, our central hypothesis is that by analyzing the intrinsic relationship among key parameters during the dimension-reduction map, a quantitative function may be constructed to reveal the correlation between the target (lower) dimensionality and other variables. To demonstrate the hypothesis, this paper first defines a formal measure function to quantify the KNN similarity for a specific vector, then extends the measure into an aggregate accuracy of the global metric spaces, and finally derives a closed-form function between the target (lower) dimensionality and other variables. We incorporate the closed-function into popular dimension-reduction methods, various distance metrics, and embedding models.
Related papers
- MPAD: A New Dimension-Reduction Method for Preserving Nearest Neighbors in High-Dimensional Vector Search [1.1701842638497677]
dimensionality reduction (DR) is seldom applied due to its tendency to distort nearest-neighbor structure critical for search.
We present MPAD: Maximum Pairwise Absolute Difference, an unsupervised DR method that explicitly preserves approximate NN relations.
experiments across multiple domains show that MPAD consistently outperforms standard DR methods in preserving neighborhood structure.
arXiv Detail & Related papers (2025-04-23T00:59:00Z) - Finsler Multi-Dimensional Scaling: Manifold Learning for Asymmetric Dimensionality Reduction and Embedding [41.601022263772535]
Dimensionality reduction aims to simplify complex data by reducing its feature dimensionality while preserving essential patterns, with core applications in data analysis and visualisation.
To preserve the underlying data structure, multi-dimensional scaling (MDS) methods focus on preserving pairwise dissimilarities, such as distances.
arXiv Detail & Related papers (2025-03-23T10:03:22Z) - Hyperboloid GPLVM for Discovering Continuous Hierarchies via Nonparametric Estimation [41.13597666007784]
Dimensionality reduction (DR) offers a useful representation of complex high-dimensional data.
Recent DR methods focus on hyperbolic geometry to derive a faithful low-dimensional representation of hierarchical data.
This paper presents hGP-LVMs to embed high-dimensional hierarchical data with implicit continuity via nonparametric estimation.
arXiv Detail & Related papers (2024-10-22T05:07:30Z) - Distributional Reduction: Unifying Dimensionality Reduction and Clustering with Gromov-Wasserstein [56.62376364594194]
Unsupervised learning aims to capture the underlying structure of potentially large and high-dimensional datasets.<n>In this work, we revisit these approaches under the lens of optimal transport and exhibit relationships with the Gromov-Wasserstein problem.<n>This unveils a new general framework, called distributional reduction, that recovers DR and clustering as special cases and allows addressing them jointly within a single optimization problem.
arXiv Detail & Related papers (2024-02-03T19:00:19Z) - Scaling Riemannian Diffusion Models [68.52820280448991]
We show that our method enables us to scale to high dimensional tasks on nontrivial manifold.
We model QCD densities on $SU(n)$ lattices and contrastively learned embeddings on high dimensional hyperspheres.
arXiv Detail & Related papers (2023-10-30T21:27:53Z) - Canonical normalizing flows for manifold learning [14.377143992248222]
We propose a canonical manifold learning flow method, where a novel objective enforces the transformation matrix to have few prominent and non-degenerate basis functions.
Canonical manifold flow yields a more efficient use of the latent space, automatically generating fewer prominent and distinct dimensions to represent data.
arXiv Detail & Related papers (2023-10-19T13:48:05Z) - Multi-Grid Tensorized Fourier Neural Operator for High-Resolution PDEs [93.82811501035569]
We introduce a new data efficient and highly parallelizable operator learning approach with reduced memory requirement and better generalization.
MG-TFNO scales to large resolutions by leveraging local and global structures of full-scale, real-world phenomena.
We demonstrate superior performance on the turbulent Navier-Stokes equations where we achieve less than half the error with over 150x compression.
arXiv Detail & Related papers (2023-09-29T20:18:52Z) - Bayesian Hyperbolic Multidimensional Scaling [2.5944208050492183]
We propose a Bayesian approach to multidimensional scaling when the low-dimensional manifold is hyperbolic.
A case-control likelihood approximation allows for efficient sampling from the posterior distribution in larger data settings.
We evaluate the proposed method against state-of-the-art alternatives using simulations, canonical reference datasets, Indian village network data, and human gene expression data.
arXiv Detail & Related papers (2022-10-26T23:34:30Z) - ParaDime: A Framework for Parametric Dimensionality Reduction [4.928716468981609]
ParaDime is a framework for parametric dimensionality reduction (DR)
It unifies parametric versions of DR techniques such as metric MDS, t-SNE, and UMAP.
It allows users to fully customize all aspects of the DR process.
arXiv Detail & Related papers (2022-10-10T11:36:56Z) - Intrinsic dimension estimation for discrete metrics [65.5438227932088]
In this letter we introduce an algorithm to infer the intrinsic dimension (ID) of datasets embedded in discrete spaces.
We demonstrate its accuracy on benchmark datasets, and we apply it to analyze a metagenomic dataset for species fingerprinting.
This suggests that evolutive pressure acts on a low-dimensional manifold despite the high-dimensionality of sequences' space.
arXiv Detail & Related papers (2022-07-20T06:38:36Z) - Manifold Hypothesis in Data Analysis: Double Geometrically-Probabilistic
Approach to Manifold Dimension Estimation [92.81218653234669]
We present new approach to manifold hypothesis checking and underlying manifold dimension estimation.
Our geometrical method is a modification for sparse data of a well-known box-counting algorithm for Minkowski dimension calculation.
Experiments on real datasets show that the suggested approach based on two methods combination is powerful and effective.
arXiv Detail & Related papers (2021-07-08T15:35:54Z) - Manifold Topology Divergence: a Framework for Comparing Data Manifolds [109.0784952256104]
We develop a framework for comparing data manifold, aimed at the evaluation of deep generative models.
Based on the Cross-Barcode, we introduce the Manifold Topology Divergence score (MTop-Divergence)
We demonstrate that the MTop-Divergence accurately detects various degrees of mode-dropping, intra-mode collapse, mode invention, and image disturbance.
arXiv Detail & Related papers (2021-06-08T00:30:43Z) - Brain Image Synthesis with Unsupervised Multivariate Canonical
CSC$\ell_4$Net [122.8907826672382]
We propose to learn dedicated features that cross both intre- and intra-modal variations using a novel CSC$ell_4$Net.
arXiv Detail & Related papers (2021-03-22T05:19:40Z) - A Local Similarity-Preserving Framework for Nonlinear Dimensionality
Reduction with Neural Networks [56.068488417457935]
We propose a novel local nonlinear approach named Vec2vec for general purpose dimensionality reduction.
To train the neural network, we build the neighborhood similarity graph of a matrix and define the context of data points.
Experiments of data classification and clustering on eight real datasets show that Vec2vec is better than several classical dimensionality reduction methods in the statistical hypothesis test.
arXiv Detail & Related papers (2021-03-10T23:10:47Z) - Perplexity-free Parametric t-SNE [11.970023029249083]
The t-distributed Neighbor Embedding (t-SNE) algorithm is a ubiquitously employed dimensionality reduction (DR) method.
It is however bounded to a user-defined perplexity parameter, restricting its DR quality compared to recently developed multi-scale perplexity-free approaches.
This paper hence proposes a multi-scale parametric t-SNE scheme, relieved from the perplexity tuning and with a deep neural network implementing the mapping.
arXiv Detail & Related papers (2020-10-03T13:47:01Z) - Manifold Learning via Manifold Deflation [105.7418091051558]
dimensionality reduction methods provide a valuable means to visualize and interpret high-dimensional data.
Many popular methods can fail dramatically, even on simple two-dimensional Manifolds.
This paper presents an embedding method for a novel, incremental tangent space estimator that incorporates global structure as coordinates.
Empirically, we show our algorithm recovers novel and interesting embeddings on real-world and synthetic datasets.
arXiv Detail & Related papers (2020-07-07T10:04:28Z) - Unsupervised Discretization by Two-dimensional MDL-based Histogram [0.0]
Unsupervised discretization is a crucial step in many knowledge discovery tasks.
We propose an expressive model class that allows for far more flexible partitions of two-dimensional data.
We introduce a algorithm, named PALM, which Partitions each dimension ALternately and then Merges neighboring regions.
arXiv Detail & Related papers (2020-06-02T19:19:49Z) - Information-Theoretic Limits for the Matrix Tensor Product [8.206394018475708]
This paper studies a high-dimensional inference problem involving the matrix tensor product of random matrices.
On the technical side, this paper introduces some new techniques for the analysis of high-dimensional matrix-preserving signals.
arXiv Detail & Related papers (2020-05-22T17:03:48Z) - Augmented Parallel-Pyramid Net for Attention Guided Pose-Estimation [90.28365183660438]
This paper proposes an augmented parallel-pyramid net with attention partial module and differentiable auto-data augmentation.
We define a new pose search space where the sequences of data augmentations are formulated as a trainable and operational CNN component.
Notably, our method achieves the top-1 accuracy on the challenging COCO keypoint benchmark and the state-of-the-art results on the MPII datasets.
arXiv Detail & Related papers (2020-03-17T03:52:17Z) - Learning Flat Latent Manifolds with VAEs [16.725880610265378]
We propose an extension to the framework of variational auto-encoders, where the Euclidean metric is a proxy for the similarity between data points.
We replace the compact prior typically used in variational auto-encoders with a recently presented, more expressive hierarchical one.
We evaluate our method on a range of data-sets, including a video-tracking benchmark.
arXiv Detail & Related papers (2020-02-12T09:54:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.