DeepSketch: A New Machine Learning-Based Reference Search Technique for
Post-Deduplication Delta Compression
- URL: http://arxiv.org/abs/2202.10584v1
- Date: Thu, 17 Feb 2022 16:00:22 GMT
- Title: DeepSketch: A New Machine Learning-Based Reference Search Technique for
Post-Deduplication Delta Compression
- Authors: Jisung Park, Jeoggyun Kim, Yeseong Kim, Sungjin Lee, Onur Mutlu
- Abstract summary: We propose DeepSketch, a new reference search technique for post-deduplication delta compression.
DeepSketch uses a deep neural network to extract a data block's sketch, i.e., to create an approximate data signature of the block.
Our evaluation shows that DeepSketch improves the data-reduction ratio by up to 33% (21% on average) over a state-of-the-art post-deduplication delta-compression technique.
- Score: 20.311114684028375
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data reduction in storage systems is becoming increasingly important as an
effective solution to minimize the management cost of a data center. To
maximize data-reduction efficiency, existing post-deduplication
delta-compression techniques perform delta compression along with traditional
data deduplication and lossless compression. Unfortunately, we observe that
existing techniques achieve significantly lower data-reduction ratios than the
optimal due to their limited accuracy in identifying similar data blocks.
In this paper, we propose DeepSketch, a new reference search technique for
post-deduplication delta compression that leverages the learning-to-hash method
to achieve higher accuracy in reference search for delta compression, thereby
improving data-reduction efficiency. DeepSketch uses a deep neural network to
extract a data block's sketch, i.e., to create an approximate data signature of
the block that can preserve similarity with other blocks. Our evaluation using
eleven real-world workloads shows that DeepSketch improves the data-reduction
ratio by up to 33% (21% on average) over a state-of-the-art post-deduplication
delta-compression technique.
Related papers
- A Brief Review for Compression and Transfer Learning Techniques in DeepFake Detection [13.783950035836593]
Training and deploying deepfake detection models on edge devices offers the advantage of maintaining data privacy and confidentiality by processing it close to its source.
We explore compression techniques to reduce computational demands and inference time, alongside transfer learning methods to minimize training overhead.
arXiv Detail & Related papers (2025-04-29T13:37:21Z) - Accelerated Methods with Compressed Communications for Distributed Optimization Problems under Data Similarity [55.03958223190181]
We propose the first theoretically grounded accelerated algorithms utilizing unbiased and biased compression under data similarity.
Our results are of record and confirmed by experiments on different average losses and datasets.
arXiv Detail & Related papers (2024-12-21T00:40:58Z) - Variable Rate Neural Compression for Sparse Detector Data [9.331686712558144]
We propose a novel approach for TPC data compression via key-point identification facilitated by sparse convolution.
BCAE-VS achieves a $75%$ improvement in reconstruction accuracy with a $10%$ increase in compression ratio over the previous state-of-the-art model.
arXiv Detail & Related papers (2024-11-18T17:15:35Z) - ODDN: Addressing Unpaired Data Challenges in Open-World Deepfake Detection on Online Social Networks [51.03118447290247]
We propose the open-world deepfake detection network (ODDN), which comprises open-world data aggregation (ODA) and compression-discard gradient correction (CGC)
ODA effectively aggregates correlations between compressed and raw samples through both fine-grained and coarse-grained analyses.
CGC incorporates a compression-discard gradient correction to further enhance performance across diverse compression methods in online social networks (OSNs)
arXiv Detail & Related papers (2024-10-24T12:32:22Z) - Sparse $L^1$-Autoencoders for Scientific Data Compression [0.0]
We introduce effective data compression methods by developing autoencoders using high dimensional latent spaces that are $L1$-regularized.
We show how these information-rich latent spaces can be used to mitigate blurring and other artifacts to obtain highly effective data compression methods for scientific data.
arXiv Detail & Related papers (2024-05-23T07:48:00Z) - Understanding The Effectiveness of Lossy Compression in Machine Learning Training Sets [7.261516807130813]
Learning and Artificial Intelligence (ML/AI) techniques have become increasingly prevalent in high performance computing.
Data compression can be a solution to these problems, but an in-depth understanding of how lossy compression affects model quality is needed.
We show modern lossy compression methods can achieve a 50-100x compression ratio improvement for a 1% or less loss in quality.
arXiv Detail & Related papers (2024-03-23T23:14:37Z) - Compression of Structured Data with Autoencoders: Provable Benefit of
Nonlinearities and Depth [83.15263499262824]
We prove that gradient descent converges to a solution that completely disregards the sparse structure of the input.
We show how to improve upon Gaussian performance for the compression of sparse data by adding a denoising function to a shallow architecture.
We validate our findings on image datasets, such as CIFAR-10 and MNIST.
arXiv Detail & Related papers (2024-02-07T16:32:29Z) - Scalable Hybrid Learning Techniques for Scientific Data Compression [6.803722400888276]
Scientists require compression techniques that accurately preserve derived quantities of interest (QoIs)
This paper presents a physics-informed compression technique implemented as an end-to-end, scalable, GPU-based pipeline for data compression.
arXiv Detail & Related papers (2022-12-21T03:00:18Z) - Unrolled Compressed Blind-Deconvolution [77.88847247301682]
sparse multichannel blind deconvolution (S-MBD) arises frequently in many engineering applications such as radar/sonar/ultrasound imaging.
We propose a compression method that enables blind recovery from much fewer measurements with respect to the full received signal in time.
arXiv Detail & Related papers (2022-09-28T15:16:58Z) - Efficient Data Compression for 3D Sparse TPC via Bicephalous
Convolutional Autoencoder [8.759778406741276]
This work introduces a dual-head autoencoder to resolve sparsity and regression simultaneously, called textitBicephalous Convolutional AutoEncoder (BCAE)
It shows advantages both in compression fidelity and ratio compared to traditional data compression methods, such as MGARD, SZ, and ZFP.
arXiv Detail & Related papers (2021-11-09T21:26:37Z) - Towards Compact CNNs via Collaborative Compression [166.86915086497433]
We propose a Collaborative Compression scheme, which joints channel pruning and tensor decomposition to compress CNN models.
We achieve 52.9% FLOPs reduction by removing 48.4% parameters on ResNet-50 with only a Top-1 accuracy drop of 0.56% on ImageNet 2012.
arXiv Detail & Related papers (2021-05-24T12:07:38Z) - An Efficient Statistical-based Gradient Compression Technique for
Distributed Training Systems [77.88178159830905]
Sparsity-Inducing Distribution-based Compression (SIDCo) is a threshold-based sparsification scheme that enjoys similar threshold estimation quality to deep gradient compression (DGC)
Our evaluation shows SIDCo speeds up training by up to 41:7%, 7:6%, and 1:9% compared to the no-compression baseline, Topk, and DGC compressors, respectively.
arXiv Detail & Related papers (2021-01-26T13:06:00Z) - PowerGossip: Practical Low-Rank Communication Compression in
Decentralized Deep Learning [62.440827696638664]
We introduce a simple algorithm that directly compresses the model differences between neighboring workers.
Inspired by the PowerSGD for centralized deep learning, this algorithm uses power steps to maximize the information transferred per bit.
arXiv Detail & Related papers (2020-08-04T09:14:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.