Dimensionality Reduction for Categorical Data
- URL: http://arxiv.org/abs/2112.00362v1
- Date: Wed, 1 Dec 2021 09:20:28 GMT
- Title: Dimensionality Reduction for Categorical Data
- Authors: Debajyoti Bera, Rameshwar Pratap, Bhisham Dev Verma
- Abstract summary: We present FSketch to create sketches for sparse categorical data and an estimator to estimate the pairwise Hamming distances.
FSketch is significantly faster, and the accuracy obtained by using its sketches are among the top for the standard unsupervised tasks of RMSE, clustering and similarity search.
- Score: 0.9560980936110233
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Categorical attributes are those that can take a discrete set of values,
e.g., colours. This work is about compressing vectors over categorical
attributes to low-dimension discrete vectors. The current hash-based methods
compressing vectors over categorical attributes to low-dimension discrete
vectors do not provide any guarantee on the Hamming distances between the
compressed representations. Here we present FSketch to create sketches for
sparse categorical data and an estimator to estimate the pairwise Hamming
distances among the uncompressed data only from their sketches. We claim that
these sketches can be used in the usual data mining tasks in place of the
original data without compromising the quality of the task. For that, we ensure
that the sketches also are categorical, sparse, and the Hamming distance
estimates are reasonably precise. Both the sketch construction and the Hamming
distance estimation algorithms require just a single-pass; furthermore, changes
to a data point can be incorporated into its sketch in an efficient manner. The
compressibility depends upon how sparse the data is and is independent of the
original dimension -- making our algorithm attractive for many real-life
scenarios. Our claims are backed by rigorous theoretical analysis of the
properties of FSketch and supplemented by extensive comparative evaluations
with related algorithms on some real-world datasets. We show that FSketch is
significantly faster, and the accuracy obtained by using its sketches are among
the top for the standard unsupervised tasks of RMSE, clustering and similarity
search.
Related papers
- Linearized Wasserstein dimensionality reduction with approximation
guarantees [65.16758672591365]
LOT Wassmap is a computationally feasible algorithm to uncover low-dimensional structures in the Wasserstein space.
We show that LOT Wassmap attains correct embeddings and that the quality improves with increased sample size.
We also show how LOT Wassmap significantly reduces the computational cost when compared to algorithms that depend on pairwise distance computations.
arXiv Detail & Related papers (2023-02-14T22:12:16Z) - Efficient Binary Embedding of Categorical Data using BinSketch [0.9560980936110233]
We propose a dimensionality reduction algorithm, aka sketching, for categorical datasets.
Cabin constructs low-dimensional binary sketches from high-dimensional categorical vectors.
Cham computes a close approximation of the Hamming distance between any two original vectors only from their sketches.
arXiv Detail & Related papers (2021-11-13T18:18:35Z) - Mean Nystr\"om Embeddings for Adaptive Compressive Learning [25.89586989444021]
We study the idea of performing sketching based on data-dependent Nystr"om approximation.
We show for k-means clustering and Gaussian modeling that for a fixed sketch size, Nystr"om sketches indeed outperform those built with random features.
arXiv Detail & Related papers (2021-10-21T09:05:58Z) - Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic
Uncertainty [58.144520501201995]
Bi-Lipschitz regularization of neural network layers preserve relative distances between data instances in the feature spaces of each layer.
With the use of an attentive set encoder, we propose to meta learn either diagonal or diagonal plus low-rank factors to efficiently construct task specific covariance matrices.
We also propose an inference procedure which utilizes scaled energy to achieve a final predictive distribution.
arXiv Detail & Related papers (2021-10-12T22:04:19Z) - Estimating leverage scores via rank revealing methods and randomization [50.591267188664666]
We study algorithms for estimating the statistical leverage scores of rectangular dense or sparse matrices of arbitrary rank.
Our approach is based on combining rank revealing methods with compositions of dense and sparse randomized dimensionality reduction transforms.
arXiv Detail & Related papers (2021-05-23T19:21:55Z) - Asymmetric compressive learning guarantees with applications to
quantized sketches [15.814495790111323]
We present a framework to train audio event classification on large-scale datasets.
We study the relaxation where this map is allowed to be different for each phase.
We then instantiate this framework to the setting of quantized sketches, by proving that the LPD indeed holds for binary sketch contributions.
arXiv Detail & Related papers (2021-04-20T15:37:59Z) - Learning Optical Flow from a Few Matches [67.83633948984954]
We show that the dense correlation volume representation is redundant and accurate flow estimation can be achieved with only a fraction of elements in it.
Experiments show that our method can reduce computational cost and memory use significantly, while maintaining high accuracy.
arXiv Detail & Related papers (2021-04-05T21:44:00Z) - Projected Hamming Dissimilarity for Bit-Level Importance Coding in
Collaborative Filtering [21.563733343861713]
We show a new way of measuring the dissimilarity between two objects in the Hamming space with binary weighting of each dimension.
We propose a variational hashing model for learning hash codes optimized for this projected Hamming dissimilarity.
The resultant hash codes lead to effectiveness gains of up to +7% in NDCG and +14% in MRR.
arXiv Detail & Related papers (2021-03-26T13:22:31Z) - Evaluating representations by the complexity of learning low-loss
predictors [55.94170724668857]
We consider the problem of evaluating representations of data for use in solving a downstream task.
We propose to measure the quality of a representation by the complexity of learning a predictor on top of the representation that achieves low loss on a task of interest.
arXiv Detail & Related papers (2020-09-15T22:06:58Z) - Learning the Positions in CountSketch [51.15935547615698]
We consider sketching algorithms which first compress data by multiplication with a random sketch matrix, and then apply the sketch to quickly solve an optimization problem.
In this work we propose the first learning algorithm that also optimize the locations of the non-zero entries.
We show this algorithm gives better accuracy for low rank approximation than previous work, and apply it to other problems such as $k$-means clustering for the first time.
arXiv Detail & Related papers (2020-07-20T05:06:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.