Clustering with UMAP: Why and How Connectivity Matters
- URL: http://arxiv.org/abs/2108.05525v1
- Date: Thu, 12 Aug 2021 04:25:03 GMT
- Title: Clustering with UMAP: Why and How Connectivity Matters
- Authors: Ayush Dalmia, Suzanna Sia
- Abstract summary: Topology based dimensionality reduction methods such as t-SNE and UMAP have seen increasing success and popularity in high-dimensional data.
We study the effects of node connectivity (k-Nearest Neighbors vs textitmutual k-Nearest Neighbors) and relative neighborhood (Adjacent via Path Neighbors) on dimensionality reduction.
- Score: 3.04585143845864
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Topology based dimensionality reduction methods such as t-SNE and UMAP have
seen increasing success and popularity in high-dimensional data. These methods
have strong mathematical foundations and are based on the intuition that the
topology in low dimensions should be close to that of high dimensions. Given
that the initial topological structure is a precursor to the success of the
algorithm, this naturally raises the question: What makes a "good" topological
structure for dimensionality reduction? %Insight into this will enable us to
design better algorithms which take into account both local and global
structure. In this paper which focuses on UMAP, we study the effects of node
connectivity (k-Nearest Neighbors vs \textit{mutual} k-Nearest Neighbors) and
relative neighborhood (Adjacent via Path Neighbors) on dimensionality
reduction. We explore these concepts through extensive ablation studies on 4
standard image and text datasets; MNIST, FMNIST, 20NG, AG, reducing to 2 and 64
dimensions. Our findings indicate that a more refined notion of connectivity
(\textit{mutual} k-Nearest Neighbors with minimum spanning tree) together with
a flexible method of constructing the local neighborhood (Path Neighbors), can
achieve a much better representation than default UMAP, as measured by
downstream clustering performance.
Related papers
- Understanding and Improving UMAP with Geometric and Topological Priors: The JORC-UMAP Algorithm [1.7484982792736636]
dimensionality reduction techniques, particularly UMAP, are widely used for visualizing high-dimensional data.<n>We introduce Ollivier-Ricci curvature as a geometric prior, reinforcing edges at geometric bottlenecks and reducing redundant links.<n>Experiments on synthetic and real-world datasets show that JORC-UMAP reduces tearing and collapse more effectively than standard UMAP and other DR methods.
arXiv Detail & Related papers (2026-01-23T08:42:56Z) - Node Embeddings via Neighbor Embeddings [11.841966603069865]
We introduce graph t-SNE and graph CNE, a contrastive neighbor embedding method that produces high-dimensional node representations.
We show that both graph t-SNE and graph CNE strongly outperform state-of-the-art algorithms in terms of local structure preservation.
arXiv Detail & Related papers (2025-03-31T08:16:03Z) - A multi-core periphery perspective: Ranking via relative centrality [4.33459568143131]
Community and core-periphery are two widely studied graph structures.
The impact of inferring the core-periphery structure of a graph on understanding its community structure is not well utilized.
We introduce a novel for graphs with ground truth communities, where each community has a densely connected part (the core) and the rest is more sparse (the periphery)
arXiv Detail & Related papers (2024-06-06T20:21:27Z) - Data Topology-Dependent Upper Bounds of Neural Network Widths [52.58441144171022]
We first show that a three-layer neural network can be designed to approximate an indicator function over a compact set.
This is then extended to a simplicial complex, deriving width upper bounds based on its topological structure.
We prove the universal approximation property of three-layer ReLU networks using our topological approach.
arXiv Detail & Related papers (2023-05-25T14:17:15Z) - Hierarchical clustering with dot products recovers hidden tree structure [53.68551192799585]
In this paper we offer a new perspective on the well established agglomerative clustering algorithm, focusing on recovery of hierarchical structure.
We recommend a simple variant of the standard algorithm, in which clusters are merged by maximum average dot product and not, for example, by minimum distance or within-cluster variance.
We demonstrate that the tree output by this algorithm provides a bona fide estimate of generative hierarchical structure in data, under a generic probabilistic graphical model.
arXiv Detail & Related papers (2023-05-24T11:05:12Z) - Efficient Graph Field Integrators Meet Point Clouds [59.27295475120132]
We present two new classes of algorithms for efficient field integration on graphs encoding point clouds.
The first class, SeparatorFactorization(SF), leverages the bounded genus of point cloud mesh graphs, while the second class, RFDiffusion(RFD), uses popular epsilon-nearest-neighbor graph representations for point clouds.
arXiv Detail & Related papers (2023-02-02T08:33:36Z) - Neighborhood Homophily-based Graph Convolutional Network [4.511171093050241]
Graph neural networks (GNNs) have been proved powerful in graph-oriented tasks.
Many real-world graphs are heterophilous, challenging the homophily assumption of classical GNNs.
Recent studies propose new metrics to characterize the homophily, but rarely consider the correlation of the proposed metrics and models.
In this paper, we first design a new metric, Neighborhood Homophily (textitNH), to measure the label complexity or purity in node neighborhoods.
arXiv Detail & Related papers (2023-01-24T07:56:44Z) - Index $t$-SNE: Tracking Dynamics of High-Dimensional Datasets with
Coherent Embeddings [1.7188280334580195]
This paper presents a methodology to reuse an embedding to create a new one, where cluster positions are preserved.
The proposed algorithm has the same complexity as the original $t$-SNE to embed new items, and a lower one when considering the embedding of a dataset sliced into sub-pieces.
arXiv Detail & Related papers (2021-09-22T06:45:37Z) - Exact and Approximate Hierarchical Clustering Using A* [51.187990314731344]
We introduce a new approach based on A* search for clustering.
We overcome the prohibitively large search space by combining A* with a novel emphtrellis data structure.
We empirically demonstrate that our method achieves substantially higher quality results than baselines for a particle physics use case and other clustering benchmarks.
arXiv Detail & Related papers (2021-04-14T18:15:27Z) - Adversarial Examples for $k$-Nearest Neighbor Classifiers Based on
Higher-Order Voronoi Diagrams [69.4411417775822]
Adversarial examples are a widely studied phenomenon in machine learning models.
We propose an algorithm for evaluating the adversarial robustness of $k$-nearest neighbor classification.
arXiv Detail & Related papers (2020-11-19T08:49:10Z) - Mix Dimension in Poincar\'{e} Geometry for 3D Skeleton-based Action
Recognition [57.98278794950759]
Graph Convolutional Networks (GCNs) have already demonstrated their powerful ability to model the irregular data.
We present a novel spatial-temporal GCN architecture which is defined via the Poincar'e geometry.
We evaluate our method on two current largest scale 3D datasets.
arXiv Detail & Related papers (2020-07-30T18:23:18Z) - Neighborhood Matching Network for Entity Alignment [71.24217694278616]
Neighborhood Matching Network (NMN) is a novel entity alignment framework.
NMN estimates the similarities between entities to capture both the topological structure and the neighborhood difference.
It first uses a novel graph sampling method to distill a discriminative neighborhood for each entity.
It then adopts a cross-graph neighborhood matching module to jointly encode the neighborhood difference for a given entity pair.
arXiv Detail & Related papers (2020-05-12T08:26:15Z) - Data Structures & Algorithms for Exact Inference in Hierarchical
Clustering [41.24805506595378]
We present novel dynamic-programming algorithms for emphexact inference in hierarchical clustering based on a novel trellis data structure.
Our algorithms scale in time and space proportional to the powerset of $N$ elements which is super-exponentially more efficient than explicitly considering each of the (2N-3)!! possible hierarchies.
arXiv Detail & Related papers (2020-02-26T17:43:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.