Compression via Pre-trained Transformers: A Study on Byte-Level Multimodal Data
- URL: http://arxiv.org/abs/2410.05078v2
- Date: Fri, 23 May 2025 10:55:43 GMT
- Title: Compression via Pre-trained Transformers: A Study on Byte-Level Multimodal Data
- Authors: David Heurtel-Depeiges, Anian Ruoss, Joel Veness, Tim Genewein,
- Abstract summary: We conduct a large-scale study to find a sweet spot where pre-trained transformers can achieve competitive compression ratios.<n>We find that relatively small parameters can outperform standard general-purpose compression algorithms.<n>We find that even small models can be trained to perform well on multiple modalities, but unlike large-scale foundation models, transfer to unseen modalities is generally weak.
- Score: 8.475091996107741
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Foundation models are strong data compressors, but when accounting for their parameter size, their compression ratios are inferior to standard compression algorithms. Naively reducing the parameter count does not necessarily help as it deteriorates predictions and, accordingly, compression. We conduct a large-scale empirical study to find a sweet spot where pre-trained vanilla transformers can achieve competitive compression ratios. To this end, we train models on 165GB of raw byte sequences of either text, image, or audio data (and all possible combinations of the three) and then compress 1GB of out-of-distribution (OOD) data from each modality. We find that relatively small models (millions of parameters) can outperform standard general-purpose compression algorithms (gzip, LZMA2) and even domain-specific compressors (PNG, JPEG-XL, FLAC) $\unicode{x2013}$ even when accounting for parameter size. We achieve, e.g., the lowest compression ratio of 0.49 on OOD audio data (vs. 0.54 for FLAC). We conduct extensive ablations and hyperparameter sweeps to study the impact of model- and dataset scale, and we investigate the effect of unimodal versus multimodal training. We find that even small models can be trained to perform well on multiple modalities, but unlike large-scale foundation models, transfer to unseen modalities is generally weak.
Related papers
- Breaking the Compression Ceiling: Data-Free Pipeline for Ultra-Efficient Delta Compression [53.08742231761896]
UltraDelta is a data-free delta compression pipeline that achieves both ultra-high compression and strong performance.<n>UltraDelta is designed to minimize redundancy, maximize information, and stabilize performance across inter-layer, intra-layer, and global dimensions.
arXiv Detail & Related papers (2025-05-19T10:37:22Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning [63.43972993473501]
Token compression expedites the training and inference of Vision Transformers (ViTs)
However, when applied to downstream tasks, compression degrees are mismatched between training and inference stages.
We propose a model arithmetic framework to decouple the compression degrees between the two stages.
arXiv Detail & Related papers (2024-08-13T10:36:43Z) - What Operations can be Performed Directly on Compressed Arrays, and with What Error? [1.3307486544794784]
We develop a lossy compressor that allows a dozen fairly fundamental operations directly on compressed data.
We evaluate it on three non-trivial applications, choosing different number systems for internal representation.
arXiv Detail & Related papers (2024-06-17T05:01:09Z) - Everything You Always Wanted to Know About Storage Compressibility of
Pre-Trained ML Models but Were Afraid to Ask [19.612260423937744]
Existing data reduction techniques are not specifically designed for pre-trained model (PTM) dataset files.
This paper presents the first, exhaustive analysis to date of PTM datasets on storage compressibility.
We develop Elves, a compression framework that integrates ELF along with several other data reduction methods.
arXiv Detail & Related papers (2024-02-20T23:45:37Z) - A Survey on Transformer Compression [84.18094368700379]
Transformer plays a vital role in the realms of natural language processing (NLP) and computer vision (CV)
Model compression methods reduce the memory and computational cost of Transformer.
This survey provides a comprehensive review of recent compression methods, with a specific focus on their application to Transformer-based models.
arXiv Detail & Related papers (2024-02-05T12:16:28Z) - Activations and Gradients Compression for Model-Parallel Training [85.99744701008802]
We study how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence.
We find that gradients require milder compression rates than activations.
Experiments also show that models trained with TopK perform well only when compression is also applied during inference.
arXiv Detail & Related papers (2024-01-15T15:54:54Z) - Data-Aware Gradient Compression for FL in Communication-Constrained Mobile Computing [20.70238092277094]
Federated Learning (FL) in mobile environments faces significant communication bottlenecks.
One-size-fits-all compression approach does not account for the varying data volumes across workers.
We propose varying compression ratios to workers with distinct data distributions and volumes.
arXiv Detail & Related papers (2023-11-13T13:24:09Z) - Lossy and Lossless (L$^2$) Post-training Model Size Compression [12.926354646945397]
We propose a post-training model size compression method that combines lossy and lossless compression in a unified way.
Our method can achieve a stable $10times$ compression ratio without sacrificing accuracy and a $20times$ compression ratio with minor accuracy loss in a short time.
arXiv Detail & Related papers (2023-08-08T14:10:16Z) - GraVAC: Adaptive Compression for Communication-Efficient Distributed DL
Training [0.0]
Distributed data-parallel (DDP) training improves overall application throughput as multiple devices train on a subset of data and aggregate updates to produce a globally shared model.
GraVAC is a framework to dynamically adjust compression factor throughout training by evaluating model progress and assessing information loss associated with compression.
As opposed to using a static compression factor, GraVAC reduces end-to-end training time for ResNet101, VGG16 and LSTM by 4.32x, 1.95x and 6.67x respectively.
arXiv Detail & Related papers (2023-05-20T14:25:17Z) - Compressing Transformer-based self-supervised models for speech
processing [45.254624876127124]
We study several commonly used compression techniques, including weight pruning, head pruning, low-rank approximation, and knowledge distillation.
We report trade-off at various compression rate, including wall-clock time, the number of parameters, and the number of multiply-accumulate operations.
Our results lead to a simple combination of compression techniques that improves trade-off over recent approaches.
arXiv Detail & Related papers (2022-11-17T23:53:52Z) - CrAM: A Compression-Aware Minimizer [103.29159003723815]
We propose a new compression-aware minimizer dubbed CrAM that modifies the optimization step in a principled way.
CrAM produces dense models that can be more accurate than the standard SGD/Adam-based baselines, but which are stable under weight pruning.
CrAM can produce sparse models which perform well for transfer learning, and it also works for semi-structured 2:4 pruning patterns supported by GPU hardware.
arXiv Detail & Related papers (2022-07-28T16:13:28Z) - Exploring Autoencoder-based Error-bounded Compression for Scientific
Data [14.724393511470225]
We develop an error-bounded autoencoder-based framework in terms of the SZ model.
We optimize the compression quality for the main stages in our designed AE-based error-bounded compression framework.
arXiv Detail & Related papers (2021-05-25T07:53:32Z) - Towards Compact CNNs via Collaborative Compression [166.86915086497433]
We propose a Collaborative Compression scheme, which joints channel pruning and tensor decomposition to compress CNN models.
We achieve 52.9% FLOPs reduction by removing 48.4% parameters on ResNet-50 with only a Top-1 accuracy drop of 0.56% on ImageNet 2012.
arXiv Detail & Related papers (2021-05-24T12:07:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.