Related papers: Sketch and Scale: Geo-distributed tSNE and UMAP

Sketch and Scale: Geo-distributed tSNE and UMAP

URL: http://arxiv.org/abs/2011.06103v1
Date: Wed, 11 Nov 2020 22:32:21 GMT
Title: Sketch and Scale: Geo-distributed tSNE and UMAP
Authors: Viska Wei, Nikita Ivkin, Vladimir Braverman, Alexander Szalay
Abstract summary: Running machine learning analytics over geographically distributed datasets is a rapidly arising problem. We introduce a novel framework: Sketch and Scale (SnS) It leverages a Count Sketch data structure to compress the data on the edge nodes, aggregates the reduced size sketches on the master node, and runs vanilla tSNE or UMAP on the summary. We show this technique to be fully parallel, scale linearly in time, logarithmically in memory, and communication, making it possible to analyze datasets with many millions, potentially billions of data points, spread across several data centers around the globe.
Score: 75.44887265789056
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Running machine learning analytics over geographically distributed datasets is a rapidly arising problem in the world of data management policies ensuring privacy and data security. Visualizing high dimensional data using tools such as t-distributed Stochastic Neighbor Embedding (tSNE) and Uniform Manifold Approximation and Projection (UMAP) became common practice for data scientists. Both tools scale poorly in time and memory. While recent optimizations showed successful handling of 10,000 data points, scaling beyond million points is still challenging. We introduce a novel framework: Sketch and Scale (SnS). It leverages a Count Sketch data structure to compress the data on the edge nodes, aggregates the reduced size sketches on the master node, and runs vanilla tSNE or UMAP on the summary, representing the densest areas, extracted from the aggregated sketch. We show this technique to be fully parallel, scale linearly in time, logarithmically in memory, and communication, making it possible to analyze datasets with many millions, potentially billions of data points, spread across several data centers around the globe. We demonstrate the power of our method on two mid-size datasets: cancer data with 52 million 35-band pixels from multiple images of tumor biopsies; and astrophysics data of 100 million stars with multi-color photometry from the Sloan Digital Sky Survey (SDSS).

Related papers

EarthView: A Large Scale Remote Sensing Dataset for Self-Supervision [72.84868704100595]
This paper presents a dataset specifically designed for self-supervision on remote sensing data, intended to enhance deep learning applications on Earth monitoring tasks. The dataset spans 15 tera pixels of global remote-sensing data, combining imagery from a diverse range of sources, including NEON, Sentinel, and a novel release of 1m spatial resolution data from Satellogic. Accompanying the dataset is EarthMAE, a tailored Masked Autoencoder developed to tackle the distinct challenges of remote sensing data.
arXiv Detail & Related papers (2025-01-14T13:42:22Z)
Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting [64.7364925689825]
Argoverse 2 (AV2) is a collection of three datasets for perception and forecasting research in the self-driving domain. The Lidar dataset contains 20,000 sequences of unlabeled lidar point clouds and map-aligned pose. The Motion Forecasting dataset contains 250,000 scenarios mined for interesting and challenging interactions between the autonomous vehicle and other actors in each local scene.
arXiv Detail & Related papers (2023-01-02T00:36:22Z)
Study of Manifold Geometry using Multiscale Non-Negative Kernel Graphs [32.40622753355266]
We propose a framework to study the geometric structure of the data. We make use of our recently introduced non-negative kernel (NNK) regression graphs to estimate the point density, intrinsic dimension, and the linearity of the data manifold (curvature)
arXiv Detail & Related papers (2022-10-31T17:01:17Z)
PolarMOT: How Far Can Geometric Relations Take Us in 3D Multi-Object Tracking? [62.997667081978825]
We encode 3D detections as nodes in a graph, where spatial and temporal pairwise relations among objects are encoded via localized polar coordinates on graph edges. This allows our graph neural network to learn to effectively encode temporal and spatial interactions. We establish a new state-of-the-art on nuScenes dataset and, more importantly, show that our method, PolarMOT, generalizes remarkably well across different locations.
arXiv Detail & Related papers (2022-08-03T10:06:56Z)
SQuadMDS: a lean Stochastic Quartet MDS improving global structure preservation in neighbor embedding like t-SNE and UMAP [3.7731754155538164]
This work introduces a force directed approach to multidimensional scaling with a time and space complexity of O(N) with N data points. The method can be combined with force directed layouts of the family of neighbour embedding such as t-SNE, to produce embeddings that preserve both the global and the local structures of the data.
arXiv Detail & Related papers (2022-02-24T13:14:58Z)
SensatUrban: Learning Semantics from Urban-Scale Photogrammetric Point Clouds [52.624157840253204]
We introduce SensatUrban, an urban-scale UAV photogrammetry point cloud dataset consisting of nearly three billion points collected from three UK cities, covering 7.6 km2. Each point in the dataset has been labelled with fine-grained semantic annotations, resulting in a dataset that is three times the size of the previous existing largest photogrammetric point cloud dataset.
arXiv Detail & Related papers (2022-01-12T14:48:11Z)
Efficient Binary Embedding of Categorical Data using BinSketch [0.9560980936110233]
We propose a dimensionality reduction algorithm, aka sketching, for categorical datasets. Cabin constructs low-dimensional binary sketches from high-dimensional categorical vectors. Cham computes a close approximation of the Hamming distance between any two original vectors only from their sketches.
arXiv Detail & Related papers (2021-11-13T18:18:35Z)
Statistical embedding: Beyond principal components [0.0]
Three methods are presented: $t$-SNE, UMAP and LargeVis based on methods in parts one, two and three, respectively. The methods are illustrated and compared on two simulated data sets; one consisting of a triple of noisy Ranunculoid curves, and one consisting of networks of increasing complexity and with two types of nodes.
arXiv Detail & Related papers (2021-06-03T14:01:21Z)
Towards Semantic Segmentation of Urban-Scale 3D Point Clouds: A Dataset, Benchmarks and Challenges [52.624157840253204]
We present an urban-scale photogrammetric point cloud dataset with nearly three billion richly annotated points. Our dataset consists of large areas from three UK cities, covering about 7.6 km2 of the city landscape. We evaluate the performance of state-of-the-art algorithms on our dataset and provide a comprehensive analysis of the results.
arXiv Detail & Related papers (2020-09-07T14:47:07Z)
Open Graph Benchmark: Datasets for Machine Learning on Graphs [86.96887552203479]
We present the Open Graph Benchmark (OGB) to facilitate scalable, robust, and reproducible graph machine learning (ML) research. OGB datasets are large-scale, encompass multiple important graph ML tasks, and cover a diverse range of domains. For each dataset, we provide a unified evaluation protocol using meaningful application-specific data splits and evaluation metrics.
arXiv Detail & Related papers (2020-05-02T03:09:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.