Efficient Binary Embedding of Categorical Data using BinSketch
- URL: http://arxiv.org/abs/2111.07163v1
- Date: Sat, 13 Nov 2021 18:18:35 GMT
- Title: Efficient Binary Embedding of Categorical Data using BinSketch
- Authors: Bhisham Dev Verma and Rameshwar Pratap and Debajyoti Bera
- Abstract summary: We propose a dimensionality reduction algorithm, aka sketching, for categorical datasets.
Cabin constructs low-dimensional binary sketches from high-dimensional categorical vectors.
Cham computes a close approximation of the Hamming distance between any two original vectors only from their sketches.
- Score: 0.9560980936110233
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this work, we present a dimensionality reduction algorithm, aka.
sketching, for categorical datasets. Our proposed sketching algorithm Cabin
constructs low-dimensional binary sketches from high-dimensional categorical
vectors, and our distance estimation algorithm Cham computes a close
approximation of the Hamming distance between any two original vectors only
from their sketches. The minimum dimension of the sketches required by Cham to
ensure a good estimation theoretically depends only on the sparsity of the data
points - making it useful for many real-life scenarios involving sparse
datasets. We present a rigorous theoretical analysis of our approach and
supplement it with extensive experiments on several high-dimensional real-world
data sets, including one with over a million dimensions. We show that the Cabin
and Cham duo is a significantly fast and accurate approach for tasks such as
RMSE, all-pairs similarity, and clustering when compared to working with the
full dataset and other dimensionality reduction techniques.
Related papers
- Distributional Reduction: Unifying Dimensionality Reduction and Clustering with Gromov-Wasserstein [56.62376364594194]
Unsupervised learning aims to capture the underlying structure of potentially large and high-dimensional datasets.
In this work, we revisit these approaches under the lens of optimal transport and exhibit relationships with the Gromov-Wasserstein problem.
This unveils a new general framework, called distributional reduction, that recovers DR and clustering as special cases and allows addressing them jointly within a single optimization problem.
arXiv Detail & Related papers (2024-02-03T19:00:19Z) - Linearized Wasserstein dimensionality reduction with approximation
guarantees [65.16758672591365]
LOT Wassmap is a computationally feasible algorithm to uncover low-dimensional structures in the Wasserstein space.
We show that LOT Wassmap attains correct embeddings and that the quality improves with increased sample size.
We also show how LOT Wassmap significantly reduces the computational cost when compared to algorithms that depend on pairwise distance computations.
arXiv Detail & Related papers (2023-02-14T22:12:16Z) - Dimensionality Reduction for Categorical Data [0.9560980936110233]
We present FSketch to create sketches for sparse categorical data and an estimator to estimate the pairwise Hamming distances.
FSketch is significantly faster, and the accuracy obtained by using its sketches are among the top for the standard unsupervised tasks of RMSE, clustering and similarity search.
arXiv Detail & Related papers (2021-12-01T09:20:28Z) - Index $t$-SNE: Tracking Dynamics of High-Dimensional Datasets with
Coherent Embeddings [1.7188280334580195]
This paper presents a methodology to reuse an embedding to create a new one, where cluster positions are preserved.
The proposed algorithm has the same complexity as the original $t$-SNE to embed new items, and a lower one when considering the embedding of a dataset sliced into sub-pieces.
arXiv Detail & Related papers (2021-09-22T06:45:37Z) - A Local Similarity-Preserving Framework for Nonlinear Dimensionality
Reduction with Neural Networks [56.068488417457935]
We propose a novel local nonlinear approach named Vec2vec for general purpose dimensionality reduction.
To train the neural network, we build the neighborhood similarity graph of a matrix and define the context of data points.
Experiments of data classification and clustering on eight real datasets show that Vec2vec is better than several classical dimensionality reduction methods in the statistical hypothesis test.
arXiv Detail & Related papers (2021-03-10T23:10:47Z) - Sketch and Scale: Geo-distributed tSNE and UMAP [75.44887265789056]
Running machine learning analytics over geographically distributed datasets is a rapidly arising problem.
We introduce a novel framework: Sketch and Scale (SnS)
It leverages a Count Sketch data structure to compress the data on the edge nodes, aggregates the reduced size sketches on the master node, and runs vanilla tSNE or UMAP on the summary.
We show this technique to be fully parallel, scale linearly in time, logarithmically in memory, and communication, making it possible to analyze datasets with many millions, potentially billions of data points, spread across several data centers around the globe.
arXiv Detail & Related papers (2020-11-11T22:32:21Z) - Evaluating representations by the complexity of learning low-loss
predictors [55.94170724668857]
We consider the problem of evaluating representations of data for use in solving a downstream task.
We propose to measure the quality of a representation by the complexity of learning a predictor on top of the representation that achieves low loss on a task of interest.
arXiv Detail & Related papers (2020-09-15T22:06:58Z) - Dimensionality Reduction via Diffusion Map Improved with Supervised
Linear Projection [1.7513645771137178]
In this paper, we assume the data samples lie on a single underlying smooth manifold.
We define intra-class and inter-class similarities using pairwise local kernel distances.
We aim to find a linear projection to maximize the intra-class similarities and minimize the inter-class similarities simultaneously.
arXiv Detail & Related papers (2020-08-08T04:26:07Z) - Two-Dimensional Semi-Nonnegative Matrix Factorization for Clustering [50.43424130281065]
We propose a new Semi-Nonnegative Matrix Factorization method for 2-dimensional (2D) data, named TS-NMF.
It overcomes the drawback of existing methods that seriously damage the spatial information of the data by converting 2D data to vectors in a preprocessing step.
arXiv Detail & Related papers (2020-05-19T05:54:14Z) - Scalable Distributed Approximation of Internal Measures for Clustering
Evaluation [5.144809478361603]
Internal measure for clustering evaluation is the silhouette coefficient, whose computation requires a quadratic number of distance calculations.
We present the first scalable algorithm to compute such a rigorous approximation for the evaluation of clusterings based on any metric distances.
We also prove that the algorithm can be adapted to obtain rigorous approximations of other internal measures of clustering quality, such as cohesion and separation.
arXiv Detail & Related papers (2020-03-03T10:28:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.