Sketch and Scale: Geo-distributed tSNE and UMAP
- URL: http://arxiv.org/abs/2011.06103v1
- Date: Wed, 11 Nov 2020 22:32:21 GMT
- Title: Sketch and Scale: Geo-distributed tSNE and UMAP
- Authors: Viska Wei, Nikita Ivkin, Vladimir Braverman, Alexander Szalay
- Abstract summary: Running machine learning analytics over geographically distributed datasets is a rapidly arising problem.
We introduce a novel framework: Sketch and Scale (SnS)
It leverages a Count Sketch data structure to compress the data on the edge nodes, aggregates the reduced size sketches on the master node, and runs vanilla tSNE or UMAP on the summary.
We show this technique to be fully parallel, scale linearly in time, logarithmically in memory, and communication, making it possible to analyze datasets with many millions, potentially billions of data points, spread across several data centers around the globe.
- Score: 75.44887265789056
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Running machine learning analytics over geographically distributed datasets
is a rapidly arising problem in the world of data management policies ensuring
privacy and data security. Visualizing high dimensional data using tools such
as t-distributed Stochastic Neighbor Embedding (tSNE) and Uniform Manifold
Approximation and Projection (UMAP) became common practice for data scientists.
Both tools scale poorly in time and memory. While recent optimizations showed
successful handling of 10,000 data points, scaling beyond million points is
still challenging. We introduce a novel framework: Sketch and Scale (SnS). It
leverages a Count Sketch data structure to compress the data on the edge nodes,
aggregates the reduced size sketches on the master node, and runs vanilla tSNE
or UMAP on the summary, representing the densest areas, extracted from the
aggregated sketch. We show this technique to be fully parallel, scale linearly
in time, logarithmically in memory, and communication, making it possible to
analyze datasets with many millions, potentially billions of data points,
spread across several data centers around the globe. We demonstrate the power
of our method on two mid-size datasets: cancer data with 52 million 35-band
pixels from multiple images of tumor biopsies; and astrophysics data of 100
million stars with multi-color photometry from the Sloan Digital Sky Survey
(SDSS).
Related papers
- Argoverse 2: Next Generation Datasets for Self-Driving Perception and
Forecasting [64.7364925689825]
Argoverse 2 (AV2) is a collection of three datasets for perception and forecasting research in the self-driving domain.
The Lidar dataset contains 20,000 sequences of unlabeled lidar point clouds and map-aligned pose.
The Motion Forecasting dataset contains 250,000 scenarios mined for interesting and challenging interactions between the autonomous vehicle and other actors in each local scene.
arXiv Detail & Related papers (2023-01-02T00:36:22Z) - Study of Manifold Geometry using Multiscale Non-Negative Kernel Graphs [32.40622753355266]
We propose a framework to study the geometric structure of the data.
We make use of our recently introduced non-negative kernel (NNK) regression graphs to estimate the point density, intrinsic dimension, and the linearity of the data manifold (curvature)
arXiv Detail & Related papers (2022-10-31T17:01:17Z) - PolarMOT: How Far Can Geometric Relations Take Us in 3D Multi-Object
Tracking? [62.997667081978825]
We encode 3D detections as nodes in a graph, where spatial and temporal pairwise relations among objects are encoded via localized polar coordinates on graph edges.
This allows our graph neural network to learn to effectively encode temporal and spatial interactions.
We establish a new state-of-the-art on nuScenes dataset and, more importantly, show that our method, PolarMOT, generalizes remarkably well across different locations.
arXiv Detail & Related papers (2022-08-03T10:06:56Z) - SQuadMDS: a lean Stochastic Quartet MDS improving global structure
preservation in neighbor embedding like t-SNE and UMAP [3.7731754155538164]
This work introduces a force directed approach to multidimensional scaling with a time and space complexity of O(N) with N data points.
The method can be combined with force directed layouts of the family of neighbour embedding such as t-SNE, to produce embeddings that preserve both the global and the local structures of the data.
arXiv Detail & Related papers (2022-02-24T13:14:58Z) - SensatUrban: Learning Semantics from Urban-Scale Photogrammetric Point
Clouds [52.624157840253204]
We introduce SensatUrban, an urban-scale UAV photogrammetry point cloud dataset consisting of nearly three billion points collected from three UK cities, covering 7.6 km2.
Each point in the dataset has been labelled with fine-grained semantic annotations, resulting in a dataset that is three times the size of the previous existing largest photogrammetric point cloud dataset.
arXiv Detail & Related papers (2022-01-12T14:48:11Z) - Efficient Binary Embedding of Categorical Data using BinSketch [0.9560980936110233]
We propose a dimensionality reduction algorithm, aka sketching, for categorical datasets.
Cabin constructs low-dimensional binary sketches from high-dimensional categorical vectors.
Cham computes a close approximation of the Hamming distance between any two original vectors only from their sketches.
arXiv Detail & Related papers (2021-11-13T18:18:35Z) - Statistical embedding: Beyond principal components [0.0]
Three methods are presented: $t$-SNE, UMAP and LargeVis based on methods in parts one, two and three, respectively.
The methods are illustrated and compared on two simulated data sets; one consisting of a triple of noisy Ranunculoid curves, and one consisting of networks of increasing complexity and with two types of nodes.
arXiv Detail & Related papers (2021-06-03T14:01:21Z) - Towards Semantic Segmentation of Urban-Scale 3D Point Clouds: A Dataset,
Benchmarks and Challenges [52.624157840253204]
We present an urban-scale photogrammetric point cloud dataset with nearly three billion richly annotated points.
Our dataset consists of large areas from three UK cities, covering about 7.6 km2 of the city landscape.
We evaluate the performance of state-of-the-art algorithms on our dataset and provide a comprehensive analysis of the results.
arXiv Detail & Related papers (2020-09-07T14:47:07Z) - Open Graph Benchmark: Datasets for Machine Learning on Graphs [86.96887552203479]
We present the Open Graph Benchmark (OGB) to facilitate scalable, robust, and reproducible graph machine learning (ML) research.
OGB datasets are large-scale, encompass multiple important graph ML tasks, and cover a diverse range of domains.
For each dataset, we provide a unified evaluation protocol using meaningful application-specific data splits and evaluation metrics.
arXiv Detail & Related papers (2020-05-02T03:09:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.