Generalized compression and compressive search of large datasets
- URL: http://arxiv.org/abs/2409.12161v1
- Date: Wed, 18 Sep 2024 17:25:31 GMT
- Title: Generalized compression and compressive search of large datasets
- Authors: Morgan E. Prior, Thomas Howard III, Emily Light, Najib Ishaq, Noah M. Daniels,
- Abstract summary: panCAKES is a novel approach to compressive search, i.e., a way to perform $k$-NN and $rho$-NN search on compressed data.
panCAKES assumes the manifold hypothesis and leverages the low-dimensional structure of the data to compress and search it efficiently.
We benchmark panCAKES on a variety of datasets, including genomic, proteomic, and set data.
- Score: 0.0
- License:
- Abstract: The Big Data explosion has necessitated the development of search algorithms that scale sub-linearly in time and memory. While compression algorithms and search algorithms do exist independently, few algorithms offer both, and those which do are domain-specific. We present panCAKES, a novel approach to compressive search, i.e., a way to perform $k$-NN and $\rho$-NN search on compressed data while only decompressing a small, relevant, portion of the data. panCAKES assumes the manifold hypothesis and leverages the low-dimensional structure of the data to compress and search it efficiently. panCAKES is generic over any distance function for which the distance between two points is proportional to the memory cost of storing an encoding of one in terms of the other. This property holds for many widely-used distance functions, e.g. string edit distances (Levenshtein, Needleman-Wunsch, etc.) and set dissimilarity measures (Jaccard, Dice, etc.). We benchmark panCAKES on a variety of datasets, including genomic, proteomic, and set data. We compare compression ratios to gzip, and search performance between the compressed and uncompressed versions of the same dataset. panCAKES achieves compression ratios close to those of gzip, while offering sub-linear time performance for $k$-NN and $\rho$-NN search. We conclude that panCAKES is an efficient, general-purpose algorithm for exact compressive search on large datasets that obey the manifold hypothesis. We provide an open-source implementation of panCAKES in the Rust programming language.
Related papers
- Lightweight Correlation-Aware Table Compression [58.50312417249682]
$texttVirtual$ is a framework that integrates seamlessly with existing open formats.
Experiments on data-gov datasets show that $texttVirtual$ reduces file sizes by up to 40% compared to Apache Parquet.
arXiv Detail & Related papers (2024-10-17T22:28:07Z) - Mixed-Precision Embeddings for Large-Scale Recommendation Models [19.93156309493436]
Mixed-Precision Embeddings (MPE) is a novel embedding compression method.
MPE achieves about 200x compression on the Criteo dataset without comprising the prediction accuracy.
arXiv Detail & Related papers (2024-09-30T14:04:27Z) - Sparse $L^1$-Autoencoders for Scientific Data Compression [0.0]
We introduce effective data compression methods by developing autoencoders using high dimensional latent spaces that are $L1$-regularized.
We show how these information-rich latent spaces can be used to mitigate blurring and other artifacts to obtain highly effective data compression methods for scientific data.
arXiv Detail & Related papers (2024-05-23T07:48:00Z) - Compression of Structured Data with Autoencoders: Provable Benefit of
Nonlinearities and Depth [83.15263499262824]
We prove that gradient descent converges to a solution that completely disregards the sparse structure of the input.
We show how to improve upon Gaussian performance for the compression of sparse data by adding a denoising function to a shallow architecture.
We validate our findings on image datasets, such as CIFAR-10 and MNIST.
arXiv Detail & Related papers (2024-02-07T16:32:29Z) - PECANN: Parallel Efficient Clustering with Graph-Based Approximate
Nearest Neighbor Search [8.15681999722805]
This paper studies density-based clustering of point sets.
It unifies the different variants of density peaks clustering into a single framework, PECANN.
We implement five clustering algorithms with PECANN and evaluate them on synthetic and real-world datasets with up to 1.28 million points and up to 1024 dimensions on a 30-core machine with two-way hyper-threading.
arXiv Detail & Related papers (2023-12-06T22:43:50Z) - Communication-Efficient Adam-Type Algorithms for Distributed Data Mining [93.50424502011626]
We propose a class of novel distributed Adam-type algorithms (emphi.e., SketchedAMSGrad) utilizing sketching.
Our new algorithm achieves a fast convergence rate of $O(frac1sqrtnT + frac1(k/d)2 T)$ with the communication cost of $O(k log(d))$ at each iteration.
arXiv Detail & Related papers (2022-10-14T01:42:05Z) - COIN++: Data Agnostic Neural Compression [55.27113889737545]
COIN++ is a neural compression framework that seamlessly handles a wide range of data modalities.
We demonstrate the effectiveness of our method by compressing various data modalities.
arXiv Detail & Related papers (2022-01-30T20:12:04Z) - Using Convolutional Neural Networks to Detect Compression Algorithms [0.0]
We use a base dataset, compressed every file with various algorithms, and designed a model based on that.
The used model was accurately able to identify files compressed using compress, lzip and bzip2.
arXiv Detail & Related papers (2021-11-17T11:03:16Z) - Partition and Code: learning how to compress graphs [50.29024357495154]
"Partition and Code" framework entails three steps: first, a partitioning algorithm decomposes the graph into elementary structures, then these are mapped to the elements of a small dictionary on which we learn a probability distribution, and finally, an entropy encoder translates the representation into bits.
Our algorithms are quantitatively evaluated on diverse real-world networks obtaining significant performance improvements with respect to different families of non-parametric and parametric graph compressor.
arXiv Detail & Related papers (2021-07-05T11:41:16Z) - Sampling from a $k$-DPP without looking at all items [58.30573872035083]
Given a kernel function and a subset size $k$, our goal is to sample $k$ out of $n$ items with probability proportional to the determinant of the kernel matrix induced by the subset (a.k.a. $k$-DPP)
Existing $k$-DPP sampling algorithms require an expensive preprocessing step which involves multiple passes over all $n$ items, making it infeasible for large datasets.
We develop an algorithm which adaptively builds a sufficiently large uniform sample of data that is then used to efficiently generate a smaller set of $k$ items.
arXiv Detail & Related papers (2020-06-30T16:40:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.