Related papers: Dimensionality Reduction for Categorical Data

Dimensionality Reduction for Categorical Data

URL: http://arxiv.org/abs/2112.00362v1
Date: Wed, 1 Dec 2021 09:20:28 GMT
Title: Dimensionality Reduction for Categorical Data
Authors: Debajyoti Bera, Rameshwar Pratap, Bhisham Dev Verma
Abstract summary: We present FSketch to create sketches for sparse categorical data and an estimator to estimate the pairwise Hamming distances. FSketch is significantly faster, and the accuracy obtained by using its sketches are among the top for the standard unsupervised tasks of RMSE, clustering and similarity search.
Score: 0.9560980936110233
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Categorical attributes are those that can take a discrete set of values, e.g., colours. This work is about compressing vectors over categorical attributes to low-dimension discrete vectors. The current hash-based methods compressing vectors over categorical attributes to low-dimension discrete vectors do not provide any guarantee on the Hamming distances between the compressed representations. Here we present FSketch to create sketches for sparse categorical data and an estimator to estimate the pairwise Hamming distances among the uncompressed data only from their sketches. We claim that these sketches can be used in the usual data mining tasks in place of the original data without compromising the quality of the task. For that, we ensure that the sketches also are categorical, sparse, and the Hamming distance estimates are reasonably precise. Both the sketch construction and the Hamming distance estimation algorithms require just a single-pass; furthermore, changes to a data point can be incorporated into its sketch in an efficient manner. The compressibility depends upon how sparse the data is and is independent of the original dimension -- making our algorithm attractive for many real-life scenarios. Our claims are backed by rigorous theoretical analysis of the properties of FSketch and supplemented by extensive comparative evaluations with related algorithms on some real-world datasets. We show that FSketch is significantly faster, and the accuracy obtained by using its sketches are among the top for the standard unsupervised tasks of RMSE, clustering and similarity search.

Related papers

Noncommutative Model Selection for Data Clustering and Dimension Reduction Using Relative von Neumann Entropy [0.0]
We propose a pair of data-driven algorithms for unsupervised classification and dimension reduction. In our experiments, our clustering algorithm outperforms $k$-means clustering on data sets with non-trivial geometry and topology.
arXiv Detail & Related papers (2024-11-29T18:04:11Z)
Linearized Wasserstein dimensionality reduction with approximation guarantees [65.16758672591365]
LOT Wassmap is a computationally feasible algorithm to uncover low-dimensional structures in the Wasserstein space. We show that LOT Wassmap attains correct embeddings and that the quality improves with increased sample size. We also show how LOT Wassmap significantly reduces the computational cost when compared to algorithms that depend on pairwise distance computations.
arXiv Detail & Related papers (2023-02-14T22:12:16Z)
Efficient Binary Embedding of Categorical Data using BinSketch [0.9560980936110233]
We propose a dimensionality reduction algorithm, aka sketching, for categorical datasets. Cabin constructs low-dimensional binary sketches from high-dimensional categorical vectors. Cham computes a close approximation of the Hamming distance between any two original vectors only from their sketches.
arXiv Detail & Related papers (2021-11-13T18:18:35Z)
Mean Nystr\"om Embeddings for Adaptive Compressive Learning [25.89586989444021]
We study the idea of performing sketching based on data-dependent Nystr"om approximation. We show for k-means clustering and Gaussian modeling that for a fixed sketch size, Nystr"om sketches indeed outperform those built with random features.
arXiv Detail & Related papers (2021-10-21T09:05:58Z)
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Uncertainty [58.144520501201995]
Bi-Lipschitz regularization of neural network layers preserve relative distances between data instances in the feature spaces of each layer. With the use of an attentive set encoder, we propose to meta learn either diagonal or diagonal plus low-rank factors to efficiently construct task specific covariance matrices. We also propose an inference procedure which utilizes scaled energy to achieve a final predictive distribution.
arXiv Detail & Related papers (2021-10-12T22:04:19Z)
Estimating leverage scores via rank revealing methods and randomization [50.591267188664666]
We study algorithms for estimating the statistical leverage scores of rectangular dense or sparse matrices of arbitrary rank. Our approach is based on combining rank revealing methods with compositions of dense and sparse randomized dimensionality reduction transforms.
arXiv Detail & Related papers (2021-05-23T19:21:55Z)
Asymmetric compressive learning guarantees with applications to quantized sketches [15.814495790111323]
We present a framework to train audio event classification on large-scale datasets. We study the relaxation where this map is allowed to be different for each phase. We then instantiate this framework to the setting of quantized sketches, by proving that the LPD indeed holds for binary sketch contributions.
arXiv Detail & Related papers (2021-04-20T15:37:59Z)
Learning Optical Flow from a Few Matches [67.83633948984954]
We show that the dense correlation volume representation is redundant and accurate flow estimation can be achieved with only a fraction of elements in it. Experiments show that our method can reduce computational cost and memory use significantly, while maintaining high accuracy.
arXiv Detail & Related papers (2021-04-05T21:44:00Z)
Projected Hamming Dissimilarity for Bit-Level Importance Coding in Collaborative Filtering [21.563733343861713]
We show a new way of measuring the dissimilarity between two objects in the Hamming space with binary weighting of each dimension. We propose a variational hashing model for learning hash codes optimized for this projected Hamming dissimilarity. The resultant hash codes lead to effectiveness gains of up to +7% in NDCG and +14% in MRR.
arXiv Detail & Related papers (2021-03-26T13:22:31Z)
Evaluating representations by the complexity of learning low-loss predictors [55.94170724668857]
We consider the problem of evaluating representations of data for use in solving a downstream task. We propose to measure the quality of a representation by the complexity of learning a predictor on top of the representation that achieves low loss on a task of interest.
arXiv Detail & Related papers (2020-09-15T22:06:58Z)
Learning the Positions in CountSketch [51.15935547615698]
We consider sketching algorithms which first compress data by multiplication with a random sketch matrix, and then apply the sketch to quickly solve an optimization problem. In this work we propose the first learning algorithm that also optimize the locations of the non-zero entries. We show this algorithm gives better accuracy for low rank approximation than previous work, and apply it to other problems such as $k$-means clustering for the first time.
arXiv Detail & Related papers (2020-07-20T05:06:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.