Compressing Transformer-based self-supervised models for speech
processing
- URL: http://arxiv.org/abs/2211.09949v2
- Date: Sat, 27 Jan 2024 03:40:26 GMT
- Title: Compressing Transformer-based self-supervised models for speech
processing
- Authors: Tzu-Quan Lin, Tsung-Huan Yang, Chun-Yao Chang, Kuang-Ming Chen,
Tzu-hsun Feng, Hung-yi Lee, Hao Tang
- Abstract summary: We study several commonly used compression techniques, including weight pruning, head pruning, low-rank approximation, and knowledge distillation.
We report trade-off at various compression rate, including wall-clock time, the number of parameters, and the number of multiply-accumulate operations.
Our results lead to a simple combination of compression techniques that improves trade-off over recent approaches.
- Score: 45.254624876127124
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the success of Transformers in self- supervised learning with
applications to various downstream tasks, the computational cost of training
and inference remains a major challenge for applying these models to a wide
spectrum of devices. Several isolated attempts have been made to compress
Transformers, but the settings and metrics are different across studies.
Trade-off at various compression rates are also largely missing in prior work,
making it difficult to compare compression techniques. In this work, we aim to
provide context for the isolated results, studying several commonly used
compression techniques, including weight pruning, head pruning, low-rank
approximation, and knowledge distillation. We report trade- off at various
compression rate, including wall-clock time, the number of parameters, and the
number of multiply-accumulate operations. Our results show that compared to
recent approaches, basic compression techniques are strong baselines. We
further present several applications of our results, revealing properties of
Transformers, such as the significance of diagonal attention heads. In
addition, our results lead to a simple combination of compression techniques
that improves trade-off over recent approaches. We hope the results would
promote more diverse comparisons among model compression techniques and promote
the use of model compression as a tool for analyzing models. Our code of
compressing speech self-supervised model is available at
https://github.com/nervjack2/Speech-SSL-Compression/.
Related papers
- Compression via Pre-trained Transformers: A Study on Byte-Level Multimodal Data [8.475091996107741]
This paper investigates whether there is a sweet spot where competitive compression ratios with pre-trained vanilla transformers are possible.
We train families of models on 165GB of raw byte sequences of either text, image, or audio data.
We find that relatively small models (i.e., millions of parameters) can outperform standard general-purpose compression algorithms.
arXiv Detail & Related papers (2024-10-07T14:32:03Z) - Compress and Compare: Interactively Evaluating Efficiency and Behavior Across ML Model Compression Experiments [20.360936113552597]
To deploy machine learning models on-device, practitioners use compression algorithms to shrink and speed up models while maintaining their high-quality output.
Existing compression tools poorly support comparison, leading to tedious and, sometimes, incomplete analyses spread across disjoint tools.
To support real-world comparative, we develop an interactive visual system called Compress and Compare.
Within a single interface, Compress and Compare surfaces promising compression strategies by visualizing provenance relationships between compressed models and reveals compression-induced behavior changes by comparing models' predictions, weights, and activations.
arXiv Detail & Related papers (2024-08-06T16:17:51Z) - A Survey on Transformer Compression [84.18094368700379]
Transformer plays a vital role in the realms of natural language processing (NLP) and computer vision (CV)
Model compression methods reduce the memory and computational cost of Transformer.
This survey provides a comprehensive review of recent compression methods, with a specific focus on their application to Transformer-based models.
arXiv Detail & Related papers (2024-02-05T12:16:28Z) - Activations and Gradients Compression for Model-Parallel Training [85.99744701008802]
We study how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence.
We find that gradients require milder compression rates than activations.
Experiments also show that models trained with TopK perform well only when compression is also applied during inference.
arXiv Detail & Related papers (2024-01-15T15:54:54Z) - The Cost of Compression: Investigating the Impact of Compression on
Parametric Knowledge in Language Models [11.156816338995503]
Large language models (LLMs) provide faster inference, smaller memory footprints, and enables local deployment.
Two standard compression techniques are pruning and quantization, with the former eliminating redundant connections in model layers and the latter representing model parameters with fewer bits.
Existing research on LLM compression primarily focuses on performance in terms of general metrics like perplexity or downstream task accuracy.
More fine-grained metrics, such as those measuring parametric knowledge, remain significantly underexplored.
arXiv Detail & Related papers (2023-12-01T22:27:12Z) - Lossy and Lossless (L$^2$) Post-training Model Size Compression [12.926354646945397]
We propose a post-training model size compression method that combines lossy and lossless compression in a unified way.
Our method can achieve a stable $10times$ compression ratio without sacrificing accuracy and a $20times$ compression ratio with minor accuracy loss in a short time.
arXiv Detail & Related papers (2023-08-08T14:10:16Z) - DiffRate : Differentiable Compression Rate for Efficient Vision
Transformers [98.33906104846386]
Token compression aims to speed up large-scale vision transformers (e.g. ViTs) by pruning (dropping) or merging tokens.
DiffRate is a novel token compression method that has several appealing properties prior arts do not have.
arXiv Detail & Related papers (2023-05-29T10:15:19Z) - Differentiable Microscopy for Content and Task Aware Compressive
Fluorescence Imaging [0.0]
Trade-off between throughput and image quality is an inherent challenge in microscopy.
Deep Learning based methods have achieved greater success in compression and image quality.
We propose differentiable compressive fluorescence microscopy.
arXiv Detail & Related papers (2022-03-28T17:53:10Z) - Analyzing and Mitigating JPEG Compression Defects in Deep Learning [69.04777875711646]
We present a unified study of the effects of JPEG compression on a range of common tasks and datasets.
We show that there is a significant penalty on common performance metrics for high compression.
arXiv Detail & Related papers (2020-11-17T20:32:57Z) - Learning End-to-End Lossy Image Compression: A Benchmark [90.35363142246806]
We first conduct a comprehensive literature survey of learned image compression methods.
We describe milestones in cutting-edge learned image-compression methods, review a broad range of existing works, and provide insights into their historical development routes.
By introducing a coarse-to-fine hyperprior model for entropy estimation and signal reconstruction, we achieve improved rate-distortion performance.
arXiv Detail & Related papers (2020-02-10T13:13:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.