Related papers: Effective and Efficient One-pass Compression of Speech Foundation Models Using Sparsity-aware Self-pinching Gates

Effective and Efficient One-pass Compression of Speech Foundation Models Using Sparsity-aware Self-pinching Gates

URL: http://arxiv.org/abs/2505.22608v1
Date: Wed, 28 May 2025 17:24:21 GMT
Title: Effective and Efficient One-pass Compression of Speech Foundation Models Using Sparsity-aware Self-pinching Gates
Authors: Haoning Xu, Zhaoqing Li, Youjun Chen, Huimeng Wang, Guinan Li, Mengzhe Geng, Chengxi Deng, Xunying Liu,
Abstract summary: This paper presents a novel approach for speech foundation models compression that tightly integrates model pruning and parameter update into a single stage.<n> Experiments conducted on the LibriSpeech-100hr corpus suggest that our approach reduces the number of parameters of wav2vec2.0-base and HuBERT-large models by 65% and 60% respectively.
Score: 20.16951333751427
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper presents a novel approach for speech foundation models compression that tightly integrates model pruning and parameter update into a single stage. Highly compact layer-level tied self-pinching gates each containing only a single learnable threshold are jointly trained with uncompressed models and used in fine-grained neuron level pruning. Experiments conducted on the LibriSpeech-100hr corpus suggest that our approach reduces the number of parameters of wav2vec2.0-base and HuBERT-large models by 65% and 60% respectively, while incurring no statistically significant word error rate (WER) increase on the test-clean dataset. Compared to previously published methods on the same task, our approach not only achieves the lowest WER of 7.05% on the test-clean dataset under a comparable model compression ratio of 4.26x, but also operates with at least 25% less model compression time.

Related papers

Forget the Data and Fine-Tuning! Just Fold the Network to Compress [13.611551223875194]
We introduce model folding, a novel data-free model compression technique that merges structurally similar neurons across layers.<n>We show that model folding achieves comparable performance to data-driven compression techniques and outperforms recently proposed data-free methods.<n>This approach is particularly effective for compressing large-scale models, making it suitable for deployment in resource-constrained environments.
arXiv Detail & Related papers (2025-02-14T15:10:43Z)
Choose Your Model Size: Any Compression by a Single Gradient Descent [9.074689052563878]
We present Any Compression via Iterative Pruning (ACIP)<n>ACIP is an algorithmic approach to determine a compression-performance trade-off from a single gradient descent run.<n>We show that ACIP seamlessly complements common quantization-based compression techniques.
arXiv Detail & Related papers (2025-02-03T18:40:58Z)
You Only Prune Once: Designing Calibration-Free Model Compression With Policy Learning [20.62274005080048]
PruneNet is a novel model compression method that reformulates model pruning as a policy learning process.<n>It can compress the LLaMA-2-7B model in just 15 minutes, achieving over 80% retention of its zero-shot performance.<n>On complex multitask language understanding tasks, PruneNet demonstrates its robustness by preserving up to 80% performance of the original model.
arXiv Detail & Related papers (2025-01-25T18:26:39Z)
FlexiGPT: Pruning and Extending Large Language Models with Low-Rank Weight Sharing [59.12511498024836]
We present a method to prune large language models (LLMs) that selectively prunes model blocks based on an importance score.<n>We propose a principled metric to replace each pruned block using a weight-sharing mechanism.<n> Empirical evaluations demonstrate substantial performance gains over existing methods.
arXiv Detail & Related papers (2025-01-24T18:46:37Z)
Activations and Gradients Compression for Model-Parallel Training [85.99744701008802]
We study how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence. We find that gradients require milder compression rates than activations. Experiments also show that models trained with TopK perform well only when compression is also applied during inference.
arXiv Detail & Related papers (2024-01-15T15:54:54Z)
Gradient-Free Structured Pruning with Unlabeled Data [57.999191898036706]
We propose a gradient-free structured pruning framework that uses only unlabeled data. Up to 40% of the original FLOP count can be reduced with less than a 4% accuracy loss across all tasks considered.
arXiv Detail & Related papers (2023-03-07T19:12:31Z)
CrAM: A Compression-Aware Minimizer [103.29159003723815]
We propose a new compression-aware minimizer dubbed CrAM that modifies the optimization step in a principled way. CrAM produces dense models that can be more accurate than the standard SGD/Adam-based baselines, but which are stable under weight pruning. CrAM can produce sparse models which perform well for transfer learning, and it also works for semi-structured 2:4 pruning patterns supported by GPU hardware.
arXiv Detail & Related papers (2022-07-28T16:13:28Z)
(Certified!!) Adversarial Robustness for Free! [116.6052628829344]
We certify 71% accuracy on ImageNet under adversarial perturbations constrained to be within a 2-norm of 0.5. We obtain these results using only pretrained diffusion models and image classifiers, without requiring any fine tuning or retraining of model parameters.
arXiv Detail & Related papers (2022-06-21T17:27:27Z)
Train Flat, Then Compress: Sharpness-Aware Minimization Learns More Compressible Models [7.6356407698088]
Pruning unnecessary parameters has emerged as a simple and effective method for compressing large models. We show that optimizing for flat minima consistently leads to greater compressibility of parameters compared to standard Adam optimization.
arXiv Detail & Related papers (2022-05-25T11:54:37Z)
Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size. A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator. In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.