Index $t$-SNE: Tracking Dynamics of High-Dimensional Datasets with
Coherent Embeddings
- URL: http://arxiv.org/abs/2109.10538v1
- Date: Wed, 22 Sep 2021 06:45:37 GMT
- Title: Index $t$-SNE: Tracking Dynamics of High-Dimensional Datasets with
Coherent Embeddings
- Authors: Ga\"elle Candel, David Naccache
- Abstract summary: This paper presents a methodology to reuse an embedding to create a new one, where cluster positions are preserved.
The proposed algorithm has the same complexity as the original $t$-SNE to embed new items, and a lower one when considering the embedding of a dataset sliced into sub-pieces.
- Score: 1.7188280334580195
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: $t$-SNE is an embedding method that the data science community has widely Two
interesting characteristics of t-SNE are the structure preservation property
and the answer to the crowding problem, where all neighbors in high dimensional
space cannot be represented correctly in low dimensional space. $t$-SNE
preserves the local neighborhood, and similar items are nicely spaced by
adjusting to the local density. These two characteristics produce a meaningful
representation, where the cluster area is proportional to its size in number,
and relationships between clusters are materialized by closeness on the
embedding.
This algorithm is non-parametric, therefore two initializations of the
algorithm would lead to two different embedding. In a forensic approach,
analysts would like to compare two or more datasets using their embedding. An
approach would be to learn a parametric model over an embedding built with a
subset of data. While this approach is highly scalable, points could be mapped
at the same exact position, making them indistinguishable. This type of model
would be unable to adapt to new outliers nor concept drift.
This paper presents a methodology to reuse an embedding to create a new one,
where cluster positions are preserved. The optimization process minimizes two
costs, one relative to the embedding shape and the second relative to the
support embedding' match. The proposed algorithm has the same complexity than
the original $t$-SNE to embed new items, and a lower one when considering the
embedding of a dataset sliced into sub-pieces. The method showed promising
results on a real-world dataset, allowing to observe the birth, evolution and
death of clusters. The proposed approach facilitates identifying significant
trends and changes, which empowers the monitoring high dimensional datasets'
dynamics.
Related papers
- Adaptive $k$-nearest neighbor classifier based on the local estimation of the shape operator [49.87315310656657]
We introduce a new adaptive $k$-nearest neighbours ($kK$-NN) algorithm that explores the local curvature at a sample to adaptively defining the neighborhood size.
Results on many real-world datasets indicate that the new $kK$-NN algorithm yields superior balanced accuracy compared to the established $k$-NN method.
arXiv Detail & Related papers (2024-09-08T13:08:45Z) - CBMAP: Clustering-based manifold approximation and projection for dimensionality reduction [0.0]
Dimensionality reduction methods are employed to decrease data dimensionality.
This study introduces a clustering-based approach, namely CBMAP, for dimensionality reduction.
CBMAP aims to preserve both global and local structures, ensuring that clusters in lower-dimensional spaces closely resemble those in high-dimensional spaces.
arXiv Detail & Related papers (2024-04-27T15:44:21Z) - Distributional Reduction: Unifying Dimensionality Reduction and Clustering with Gromov-Wasserstein [56.62376364594194]
Unsupervised learning aims to capture the underlying structure of potentially large and high-dimensional datasets.
In this work, we revisit these approaches under the lens of optimal transport and exhibit relationships with the Gromov-Wasserstein problem.
This unveils a new general framework, called distributional reduction, that recovers DR and clustering as special cases and allows addressing them jointly within a single optimization problem.
arXiv Detail & Related papers (2024-02-03T19:00:19Z) - Large-scale Fully-Unsupervised Re-Identification [78.47108158030213]
We propose two strategies to learn from large-scale unlabeled data.
The first strategy performs a local neighborhood sampling to reduce the dataset size in each without violating neighborhood relationships.
A second strategy leverages a novel Re-Ranking technique, which has a lower time upper bound complexity and reduces the memory complexity from O(n2) to O(kn) with k n.
arXiv Detail & Related papers (2023-07-26T16:19:19Z) - Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - Hierarchical Nearest Neighbor Graph Embedding for Efficient
Dimensionality Reduction [25.67957712837716]
We introduce a novel method based on a hierarchy built on 1-nearest neighbor graphs in the original space.
The proposal is an optimization-free projection that is competitive with the latest versions of t-SNE and UMAP.
In the paper, we argue about the soundness of the proposed method and evaluate it on a diverse collection of datasets with sizes varying from 1K to 11M samples and dimensions from 28 to 16K.
arXiv Detail & Related papers (2022-03-24T11:41:16Z) - BikNN: Anomaly Estimation in Bilateral Domains with k-Nearest Neighbors [1.2183405753834562]
A novel framework for anomaly estimation is proposed in this paper.
We attempt to estimate the degree of anomaly in both spatial and density domains.
Our method takes into account both the spatial domain and the density domain and can be adapted to different datasets by adjusting a few parameters manually.
arXiv Detail & Related papers (2021-05-11T13:45:29Z) - A Local Similarity-Preserving Framework for Nonlinear Dimensionality
Reduction with Neural Networks [56.068488417457935]
We propose a novel local nonlinear approach named Vec2vec for general purpose dimensionality reduction.
To train the neural network, we build the neighborhood similarity graph of a matrix and define the context of data points.
Experiments of data classification and clustering on eight real datasets show that Vec2vec is better than several classical dimensionality reduction methods in the statistical hypothesis test.
arXiv Detail & Related papers (2021-03-10T23:10:47Z) - Kernel Two-Dimensional Ridge Regression for Subspace Clustering [45.651770340521786]
We propose a novel subspace clustering method for 2D data.
It directly uses 2D data as inputs such that the learning of representations benefits from inherent structures and relationships of the data.
arXiv Detail & Related papers (2020-11-03T04:52:46Z) - Two-Dimensional Semi-Nonnegative Matrix Factorization for Clustering [50.43424130281065]
We propose a new Semi-Nonnegative Matrix Factorization method for 2-dimensional (2D) data, named TS-NMF.
It overcomes the drawback of existing methods that seriously damage the spatial information of the data by converting 2D data to vectors in a preprocessing step.
arXiv Detail & Related papers (2020-05-19T05:54:14Z) - Stochastic Sparse Subspace Clustering [20.30051592270384]
State-of-the-art subspace clustering methods are based on self-expressive model, which represents each data point as a linear combination of other data points.
We introduce dropout to address the issue of over-segmentation, which is based on randomly dropping out data points.
This leads to a scalable and flexible sparse subspace clustering approach, termed Sparse Subspace Clustering.
arXiv Detail & Related papers (2020-05-04T13:09:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.