Related papers: Compressing Transformer-based self-supervised models for speech processing

Compressing Transformer-based self-supervised models for speech processing

URL: http://arxiv.org/abs/2211.09949v2
Date: Sat, 27 Jan 2024 03:40:26 GMT
Title: Compressing Transformer-based self-supervised models for speech processing
Authors: Tzu-Quan Lin, Tsung-Huan Yang, Chun-Yao Chang, Kuang-Ming Chen, Tzu-hsun Feng, Hung-yi Lee, Hao Tang
Abstract summary: We study several commonly used compression techniques, including weight pruning, head pruning, low-rank approximation, and knowledge distillation. We report trade-off at various compression rate, including wall-clock time, the number of parameters, and the number of multiply-accumulate operations. Our results lead to a simple combination of compression techniques that improves trade-off over recent approaches.
Score: 45.254624876127124
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite the success of Transformers in self- supervised learning with applications to various downstream tasks, the computational cost of training and inference remains a major challenge for applying these models to a wide spectrum of devices. Several isolated attempts have been made to compress Transformers, but the settings and metrics are different across studies. Trade-off at various compression rates are also largely missing in prior work, making it difficult to compare compression techniques. In this work, we aim to provide context for the isolated results, studying several commonly used compression techniques, including weight pruning, head pruning, low-rank approximation, and knowledge distillation. We report trade- off at various compression rate, including wall-clock time, the number of parameters, and the number of multiply-accumulate operations. Our results show that compared to recent approaches, basic compression techniques are strong baselines. We further present several applications of our results, revealing properties of Transformers, such as the significance of diagonal attention heads. In addition, our results lead to a simple combination of compression techniques that improves trade-off over recent approaches. We hope the results would promote more diverse comparisons among model compression techniques and promote the use of model compression as a tool for analyzing models. Our code of compressing speech self-supervised model is available at https://github.com/nervjack2/Speech-SSL-Compression/.

Related papers

Compression via Pre-trained Transformers: A Study on Byte-Level Multimodal Data [8.475091996107741]
This paper investigates whether there is a sweet spot where competitive compression ratios with pre-trained vanilla transformers are possible. We train families of models on 165GB of raw byte sequences of either text, image, or audio data. We find that relatively small models (i.e., millions of parameters) can outperform standard general-purpose compression algorithms.
arXiv Detail & Related papers (2024-10-07T14:32:03Z)
Compress and Compare: Interactively Evaluating Efficiency and Behavior Across ML Model Compression Experiments [20.360936113552597]
To deploy machine learning models on-device, practitioners use compression algorithms to shrink and speed up models while maintaining their high-quality output. Existing compression tools poorly support comparison, leading to tedious and, sometimes, incomplete analyses spread across disjoint tools. To support real-world comparative, we develop an interactive visual system called Compress and Compare. Within a single interface, Compress and Compare surfaces promising compression strategies by visualizing provenance relationships between compressed models and reveals compression-induced behavior changes by comparing models' predictions, weights, and activations.
arXiv Detail & Related papers (2024-08-06T16:17:51Z)
A Survey on Transformer Compression [84.18094368700379]
Transformer plays a vital role in the realms of natural language processing (NLP) and computer vision (CV) Model compression methods reduce the memory and computational cost of Transformer. This survey provides a comprehensive review of recent compression methods, with a specific focus on their application to Transformer-based models.
arXiv Detail & Related papers (2024-02-05T12:16:28Z)
Activations and Gradients Compression for Model-Parallel Training [85.99744701008802]
We study how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence. We find that gradients require milder compression rates than activations. Experiments also show that models trained with TopK perform well only when compression is also applied during inference.
arXiv Detail & Related papers (2024-01-15T15:54:54Z)
The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models [11.156816338995503]
Large language models (LLMs) provide faster inference, smaller memory footprints, and enables local deployment. Two standard compression techniques are pruning and quantization, with the former eliminating redundant connections in model layers and the latter representing model parameters with fewer bits. Existing research on LLM compression primarily focuses on performance in terms of general metrics like perplexity or downstream task accuracy. More fine-grained metrics, such as those measuring parametric knowledge, remain significantly underexplored.
arXiv Detail & Related papers (2023-12-01T22:27:12Z)
Lossy and Lossless (L$^2$) Post-training Model Size Compression [12.926354646945397]
We propose a post-training model size compression method that combines lossy and lossless compression in a unified way. Our method can achieve a stable $10times$ compression ratio without sacrificing accuracy and a $20times$ compression ratio with minor accuracy loss in a short time.
arXiv Detail & Related papers (2023-08-08T14:10:16Z)
DiffRate : Differentiable Compression Rate for Efficient Vision Transformers [98.33906104846386]
Token compression aims to speed up large-scale vision transformers (e.g. ViTs) by pruning (dropping) or merging tokens. DiffRate is a novel token compression method that has several appealing properties prior arts do not have.
arXiv Detail & Related papers (2023-05-29T10:15:19Z)
Differentiable Microscopy for Content and Task Aware Compressive Fluorescence Imaging [0.0]
Trade-off between throughput and image quality is an inherent challenge in microscopy. Deep Learning based methods have achieved greater success in compression and image quality. We propose differentiable compressive fluorescence microscopy.
arXiv Detail & Related papers (2022-03-28T17:53:10Z)
Analyzing and Mitigating JPEG Compression Defects in Deep Learning [69.04777875711646]
We present a unified study of the effects of JPEG compression on a range of common tasks and datasets. We show that there is a significant penalty on common performance metrics for high compression.
arXiv Detail & Related papers (2020-11-17T20:32:57Z)
Learning End-to-End Lossy Image Compression: A Benchmark [90.35363142246806]
We first conduct a comprehensive literature survey of learned image compression methods. We describe milestones in cutting-edge learned image-compression methods, review a broad range of existing works, and provide insights into their historical development routes. By introducing a coarse-to-fine hyperprior model for entropy estimation and signal reconstruction, we achieve improved rate-distortion performance.
arXiv Detail & Related papers (2020-02-10T13:13:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.