Compressing Transformer-based self-supervised models for speech
processing
- URL: http://arxiv.org/abs/2211.09949v2
- Date: Sat, 27 Jan 2024 03:40:26 GMT
- Title: Compressing Transformer-based self-supervised models for speech
processing
- Authors: Tzu-Quan Lin, Tsung-Huan Yang, Chun-Yao Chang, Kuang-Ming Chen,
Tzu-hsun Feng, Hung-yi Lee, Hao Tang
- Abstract summary: We study several commonly used compression techniques, including weight pruning, head pruning, low-rank approximation, and knowledge distillation.
We report trade-off at various compression rate, including wall-clock time, the number of parameters, and the number of multiply-accumulate operations.
Our results lead to a simple combination of compression techniques that improves trade-off over recent approaches.
- Score: 45.254624876127124
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the success of Transformers in self- supervised learning with
applications to various downstream tasks, the computational cost of training
and inference remains a major challenge for applying these models to a wide
spectrum of devices. Several isolated attempts have been made to compress
Transformers, but the settings and metrics are different across studies.
Trade-off at various compression rates are also largely missing in prior work,
making it difficult to compare compression techniques. In this work, we aim to
provide context for the isolated results, studying several commonly used
compression techniques, including weight pruning, head pruning, low-rank
approximation, and knowledge distillation. We report trade- off at various
compression rate, including wall-clock time, the number of parameters, and the
number of multiply-accumulate operations. Our results show that compared to
recent approaches, basic compression techniques are strong baselines. We
further present several applications of our results, revealing properties of
Transformers, such as the significance of diagonal attention heads. In
addition, our results lead to a simple combination of compression techniques
that improves trade-off over recent approaches. We hope the results would
promote more diverse comparisons among model compression techniques and promote
the use of model compression as a tool for analyzing models. Our code of
compressing speech self-supervised model is available at
https://github.com/nervjack2/Speech-SSL-Compression/.
Related papers
- Token Compression Meets Compact Vision Transformers: A Survey and Comparative Evaluation for Edge AI [26.45869748408205]
Token compression techniques have emerged as powerful tools for Vision Transformer (ViT) inference in computer vision.<n>We present the first systematic taxonomy and comparative study of token compression methods.<n>Our experiments reveal that while token compression methods are effective for general-purpose ViTs, they often underperform when directly applied to compact designs.
arXiv Detail & Related papers (2025-07-13T16:26:05Z) - Efficient Token Compression for Vision Transformer with Spatial Information Preserved [59.79302182800274]
Token compression is essential for reducing the computational and memory requirements of transformer models.<n>We propose an efficient and hardware-compatible token compression method called Prune and Merge.
arXiv Detail & Related papers (2025-03-30T14:23:18Z) - Compression via Pre-trained Transformers: A Study on Byte-Level Multimodal Data [8.475091996107741]
This paper investigates whether there is a sweet spot where competitive compression ratios with pre-trained vanilla transformers are possible.
We train families of models on 165GB of raw byte sequences of either text, image, or audio data.
We find that relatively small models (i.e., millions of parameters) can outperform standard general-purpose compression algorithms.
arXiv Detail & Related papers (2024-10-07T14:32:03Z) - TACO-RL: Task Aware Prompt Compression Optimization with Reinforcement Learning [11.167198972934736]
Large language models (LLMs) such as GPT-4 have led to a surge in the size of prompts required for optimal performance.<n>We propose a novel and efficient reinforcement learning (RL) based task-aware prompt compression method.<n>We demonstrate that our RL-guided compression method improves the task performance by 8% - 189% over state-of-the-art compression techniques.
arXiv Detail & Related papers (2024-09-19T18:11:59Z) - Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning [63.43972993473501]
Token compression expedites the training and inference of Vision Transformers (ViTs)
However, when applied to downstream tasks, compression degrees are mismatched between training and inference stages.
We propose a model arithmetic framework to decouple the compression degrees between the two stages.
arXiv Detail & Related papers (2024-08-13T10:36:43Z) - Compress and Compare: Interactively Evaluating Efficiency and Behavior Across ML Model Compression Experiments [20.360936113552597]
To deploy machine learning models on-device, practitioners use compression algorithms to shrink and speed up models while maintaining their high-quality output.
Existing compression tools poorly support comparison, leading to tedious and, sometimes, incomplete analyses spread across disjoint tools.
To support real-world comparative, we develop an interactive visual system called Compress and Compare.
Within a single interface, Compress and Compare surfaces promising compression strategies by visualizing provenance relationships between compressed models and reveals compression-induced behavior changes by comparing models' predictions, weights, and activations.
arXiv Detail & Related papers (2024-08-06T16:17:51Z) - A Survey on Transformer Compression [84.18094368700379]
Transformer plays a vital role in the realms of natural language processing (NLP) and computer vision (CV)
Model compression methods reduce the memory and computational cost of Transformer.
This survey provides a comprehensive review of recent compression methods, with a specific focus on their application to Transformer-based models.
arXiv Detail & Related papers (2024-02-05T12:16:28Z) - Activations and Gradients Compression for Model-Parallel Training [85.99744701008802]
We study how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence.
We find that gradients require milder compression rates than activations.
Experiments also show that models trained with TopK perform well only when compression is also applied during inference.
arXiv Detail & Related papers (2024-01-15T15:54:54Z) - The Cost of Compression: Investigating the Impact of Compression on
Parametric Knowledge in Language Models [11.156816338995503]
Large language models (LLMs) provide faster inference, smaller memory footprints, and enables local deployment.
Two standard compression techniques are pruning and quantization, with the former eliminating redundant connections in model layers and the latter representing model parameters with fewer bits.
Existing research on LLM compression primarily focuses on performance in terms of general metrics like perplexity or downstream task accuracy.
More fine-grained metrics, such as those measuring parametric knowledge, remain significantly underexplored.
arXiv Detail & Related papers (2023-12-01T22:27:12Z) - Approximating Human-Like Few-shot Learning with GPT-based Compression [55.699707962017975]
We seek to equip generative pre-trained models with human-like learning capabilities that enable data compression during inference.
We present a novel approach that utilizes the Generative Pre-trained Transformer (GPT) to approximate Kolmogorov complexity.
arXiv Detail & Related papers (2023-08-14T05:22:33Z) - Lossy and Lossless (L$^2$) Post-training Model Size Compression [12.926354646945397]
We propose a post-training model size compression method that combines lossy and lossless compression in a unified way.
Our method can achieve a stable $10times$ compression ratio without sacrificing accuracy and a $20times$ compression ratio with minor accuracy loss in a short time.
arXiv Detail & Related papers (2023-08-08T14:10:16Z) - DiffRate : Differentiable Compression Rate for Efficient Vision
Transformers [98.33906104846386]
Token compression aims to speed up large-scale vision transformers (e.g. ViTs) by pruning (dropping) or merging tokens.
DiffRate is a novel token compression method that has several appealing properties prior arts do not have.
arXiv Detail & Related papers (2023-05-29T10:15:19Z) - Extreme Compression for Pre-trained Transformers Made Simple and
Efficient [31.719905773863566]
Extreme compression, particularly ultra-low bit precision (binary/ternary) quantization, has been proposed to fit large NLP models on resource-constraint devices.
We propose a simple yet effective compression pipeline for extreme compression, named XTC.
arXiv Detail & Related papers (2022-06-04T00:19:45Z) - LightHuBERT: Lightweight and Configurable Speech Representation Learning
with Once-for-All Hidden-Unit BERT [69.77358429702873]
We propose LightHuBERT, a once-for-all Transformer compression framework, to find the desired architectures automatically.
Experiments on automatic speech recognition (ASR) and the SUPERB benchmark show the proposed LightHuBERT enables over $109$ architectures.
LightHuBERT achieves comparable performance to the teacher model in most tasks with a reduction of 29% parameters.
arXiv Detail & Related papers (2022-03-29T14:20:55Z) - Differentiable Microscopy for Content and Task Aware Compressive
Fluorescence Imaging [0.0]
Trade-off between throughput and image quality is an inherent challenge in microscopy.
Deep Learning based methods have achieved greater success in compression and image quality.
We propose differentiable compressive fluorescence microscopy.
arXiv Detail & Related papers (2022-03-28T17:53:10Z) - Extreme Model Compression for On-device Natural Language Understanding [6.941609786551173]
We show our results on a large-scale, commercial NLU system trained on a varied set of intents with huge vocabulary sizes.
Our approach outperforms a range of baselines and achieves a compression rate of 97.4% with less than 3.7% degradation in predictive performance.
arXiv Detail & Related papers (2020-11-30T21:47:48Z) - Analyzing and Mitigating JPEG Compression Defects in Deep Learning [69.04777875711646]
We present a unified study of the effects of JPEG compression on a range of common tasks and datasets.
We show that there is a significant penalty on common performance metrics for high compression.
arXiv Detail & Related papers (2020-11-17T20:32:57Z) - Learning End-to-End Lossy Image Compression: A Benchmark [90.35363142246806]
We first conduct a comprehensive literature survey of learned image compression methods.
We describe milestones in cutting-edge learned image-compression methods, review a broad range of existing works, and provide insights into their historical development routes.
By introducing a coarse-to-fine hyperprior model for entropy estimation and signal reconstruction, we achieve improved rate-distortion performance.
arXiv Detail & Related papers (2020-02-10T13:13:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.