Related papers: Random Cycle Coding: Lossless Compression of Cluster Assignments via Bits-Back Coding

Random Cycle Coding: Lossless Compression of Cluster Assignments via Bits-Back Coding

URL: http://arxiv.org/abs/2412.00369v1
Date: Sat, 30 Nov 2024 06:24:34 GMT
Title: Random Cycle Coding: Lossless Compression of Cluster Assignments via Bits-Back Coding
Authors: Daniel Severo, Ashish Khisti, Alireza Makhzani,
Abstract summary: We present an optimal method for encoding cluster assignments of arbitrary data sets.<n>Our method, Random Cycle Coding (RCC), encodes data sequentially and sends assignment information as cycles of the permutation defined by the order of encoded elements.
Score: 20.815837902767072
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present an optimal method for encoding cluster assignments of arbitrary data sets. Our method, Random Cycle Coding (RCC), encodes data sequentially and sends assignment information as cycles of the permutation defined by the order of encoded elements. RCC does not require any training and its worst-case complexity scales quasi-linearly with the size of the largest cluster. We characterize the achievable bit rates as a function of cluster sizes and number of elements, showing RCC consistently outperforms previous methods while requiring less compute and memory resources. Experiments show RCC can save up to 2 bytes per element when applied to vector databases, and removes the need for assigning integer ids to identify vectors, translating to savings of up to 70% in vector database systems for similarity search applications.

Related papers

Adaptive and Robust DBSCAN with Multi-agent Reinforcement Learning [53.527506374566485]
We propose a novel Adaptive and Robust DBSCAN with Multi-agent Reinforcement Learning cluster framework, namely AR-DBSCAN.<n>We show that AR-DBSCAN not only improves clustering accuracy by up to 144.1% and 175.3% in the NMI and ARI metrics, respectively, but also is capable of robustly finding dominant parameters.
arXiv Detail & Related papers (2025-05-07T11:37:23Z)
Efficient Ranking, Order Statistics, and Sorting under CKKS [5.543544712471747]
Homomorphic Encryption (FHE) enables operations on encrypted data, making it extremely useful for privacy-preserving applications. The high computational overhead and limited native operations of FHE pose significant challenges for an efficient implementation of these tasks. We present solutions for ranking, order statistics, and sorting, that achieve a comparison depth of up to 2 (constant)
arXiv Detail & Related papers (2024-12-19T18:06:25Z)
Hierarchical Clustering using Reversible Binary Cellular Automata for High-Dimensional Data [0.0]
In cellular automaton (CA) based clustering, if two objects belong to the same cycle, they are closely related and considered as part of the same cluster. This paper identifies the relationship between objects in two different cycles based on the median of all elements in each cycle so that they can be grouped in the next stage. When verified over standard benchmark datasets with various performance metrics, our algorithm is at par with the existing algorithms with quadratic time complexity.
arXiv Detail & Related papers (2024-08-05T05:48:45Z)
Determining the Optimal Number of Clusters for Time Series Datasets with Symbolic Pattern Forest [0.0]
The problem of calculating the optimal number of clusters (say k) is one of the significant challenges for such methods. In this work, we extended the Symbolic Pattern Forest algorithm to determine the optimal number of clusters for the time series datasets. We tested our approach on the UCR archive datasets, and our experimental results so far showed significant improvement over the baseline.
arXiv Detail & Related papers (2023-10-01T23:33:37Z)
An Efficient Algorithm for Clustered Multi-Task Compressive Sensing [60.70532293880842]
Clustered multi-task compressive sensing is a hierarchical model that solves multiple compressive sensing tasks. The existing inference algorithm for this model is computationally expensive and does not scale well in high dimensions. We propose a new algorithm that substantially accelerates model inference by avoiding the need to explicitly compute these covariance matrices.
arXiv Detail & Related papers (2023-09-30T15:57:14Z)
Vector Embeddings by Sequence Similarity and Context for Improved Compression, Similarity Search, Clustering, Organization, and Manipulation of cDNA Libraries [3.162643581562756]
This paper demonstrates the utility of organized numerical representations of genes in research involving flat string gene formats (i.e., FASTA/FASTQ5). The solution lies in transforming sequences into an alternative representation that facilitates easier clustering into similar groups compared to the raw sequences themselves.
arXiv Detail & Related papers (2023-08-08T17:31:17Z)
Factorizers for Distributed Sparse Block Codes [45.29870215671697]
We propose a fast and highly accurate method for factorizing distributed block codes (SBCs) Our iterative factorizer introduces a threshold-based nonlinear activation, conditional random sampling, and an $ell_infty$-based similarity metric. We demonstrate the feasibility of our method on four deep CNN architectures over CIFAR-100, ImageNet-1K, and RAVEN datasets.
arXiv Detail & Related papers (2023-03-24T12:31:48Z)
Efficient Adversarial Contrastive Learning via Robustness-Aware Coreset Selection [59.77647907277523]
Adversarial contrast learning (ACL) does not require expensive data annotations but outputs a robust representation that withstands adversarial attacks. ACL needs tremendous running time to generate the adversarial variants of all training data. This paper proposes a robustness-aware coreset selection (RCS) method to speed up ACL.
arXiv Detail & Related papers (2023-02-08T03:20:14Z)
Rapid Person Re-Identification via Sub-space Consistency Regularization [51.76876061721556]
Person Re-Identification (ReID) matches pedestrians across disjoint cameras. Existing ReID methods adopting real-value feature descriptors have achieved high accuracy, but they are low in efficiency due to the slow Euclidean distance computation. We propose a novel Sub-space Consistency Regularization (SCR) algorithm that can speed up the ReID procedure by 0.25$ times.
arXiv Detail & Related papers (2022-07-13T02:44:05Z)
Very Compact Clusters with Structural Regularization via Similarity and Connectivity [3.779514860341336]
We propose an end-to-end deep clustering algorithm, i.e., Very Compact Clusters (VCC) for the general datasets. Our proposed approach achieves better clustering performance over most of the state-of-the-art clustering methods.
arXiv Detail & Related papers (2021-06-09T23:22:03Z)
Gradient Coding with Dynamic Clustering for Straggler-Tolerant Distributed Learning [55.052517095437]
gradient descent (GD) is widely employed to parallelize the learning task by distributing the dataset across multiple workers. A significant performance bottleneck for the per-iteration completion time in distributed synchronous GD is $straggling$ workers. Coded distributed techniques have been introduced recently to mitigate stragglers and to speed up GD iterations by assigning redundant computations to workers. We propose a novel dynamic GC scheme, which assigns redundant data to workers to acquire the flexibility to choose from among a set of possible codes depending on the past straggling behavior.
arXiv Detail & Related papers (2021-03-01T18:51:29Z)
Gradient Coding with Dynamic Clustering for Straggler Mitigation [57.9123881133818]
GC-DC regulates the number of straggling workers in each cluster based on the straggler behavior in the previous iteration. We numerically show that GC-DC provides significant improvements in the average completion time (of each iteration) with no increase in the communication load compared to the original GC scheme.
arXiv Detail & Related papers (2020-11-03T18:52:15Z)
Optimal Clustering from Noisy Binary Feedback [75.17453757892152]
We study the problem of clustering a set of items from binary user feedback. We devise an algorithm with a minimal cluster recovery error rate. For adaptive selection, we develop an algorithm inspired by the derivation of the information-theoretical error lower bounds.
arXiv Detail & Related papers (2019-10-14T09:18:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.