Sparsification via Compressed Sensing for Automatic Speech Recognition
- URL: http://arxiv.org/abs/2102.04932v1
- Date: Tue, 9 Feb 2021 16:41:31 GMT
- Title: Sparsification via Compressed Sensing for Automatic Speech Recognition
- Authors: Kai Zhen (1 and 2), Hieu Duy Nguyen (2), Feng-Ju Chang (2), Athanasios
Mouchtaris (2), and Ariya Rastrow (2). ((1) Indiana University Bloomington,
(2) Alexa Machine Learning, Amazon, USA)
- Abstract summary: Large scale machine learning applications require model quantization and compression.
We propose a compressed sensing based pruning (CSP) approach to effectively address those questions.
We show that CSP consistently outperforms existing approaches in the literature.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In order to achieve high accuracy for machine learning (ML) applications, it
is essential to employ models with a large number of parameters. Certain
applications, such as Automatic Speech Recognition (ASR), however, require
real-time interactions with users, hence compelling the model to have as low
latency as possible. Deploying large scale ML applications thus necessitates
model quantization and compression, especially when running ML models on
resource constrained devices. For example, by forcing some of the model weight
values into zero, it is possible to apply zero-weight compression, which
reduces both the model size and model reading time from the memory. In the
literature, such methods are referred to as sparse pruning. The fundamental
questions are when and which weights should be forced to zero, i.e. be pruned.
In this work, we propose a compressed sensing based pruning (CSP) approach to
effectively address those questions. By reformulating sparse pruning as a
sparsity inducing and compression-error reduction dual problem, we introduce
the classic compressed sensing process into the ML model training process.
Using ASR task as an example, we show that CSP consistently outperforms
existing approaches in the literature.
Related papers
- Choose Your Model Size: Any Compression by a Single Gradient Descent [9.074689052563878]
We present Any Compression via Iterative Pruning (ACIP)
ACIP is an algorithmic approach to determine a compression-performance trade-off from a single gradient descent run.
We show that ACIP seamlessly complements common quantization-based compression techniques.
arXiv Detail & Related papers (2025-02-03T18:40:58Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - Activations and Gradients Compression for Model-Parallel Training [85.99744701008802]
We study how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence.
We find that gradients require milder compression rates than activations.
Experiments also show that models trained with TopK perform well only when compression is also applied during inference.
arXiv Detail & Related papers (2024-01-15T15:54:54Z) - The Cost of Compression: Investigating the Impact of Compression on
Parametric Knowledge in Language Models [11.156816338995503]
Large language models (LLMs) provide faster inference, smaller memory footprints, and enables local deployment.
Two standard compression techniques are pruning and quantization, with the former eliminating redundant connections in model layers and the latter representing model parameters with fewer bits.
Existing research on LLM compression primarily focuses on performance in terms of general metrics like perplexity or downstream task accuracy.
More fine-grained metrics, such as those measuring parametric knowledge, remain significantly underexplored.
arXiv Detail & Related papers (2023-12-01T22:27:12Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - Deep learning model compression using network sensitivity and gradients [3.52359746858894]
We present model compression algorithms for both non-retraining and retraining conditions.
In the first case, we propose the Bin & Quant algorithm for compression of the deep learning models using the sensitivity of the network parameters.
In the second case, we propose our novel gradient-weighted k-means clustering algorithm (GWK)
arXiv Detail & Related papers (2022-10-11T03:02:40Z) - CrAM: A Compression-Aware Minimizer [103.29159003723815]
We propose a new compression-aware minimizer dubbed CrAM that modifies the optimization step in a principled way.
CrAM produces dense models that can be more accurate than the standard SGD/Adam-based baselines, but which are stable under weight pruning.
CrAM can produce sparse models which perform well for transfer learning, and it also works for semi-structured 2:4 pruning patterns supported by GPU hardware.
arXiv Detail & Related papers (2022-07-28T16:13:28Z) - Automatic Mixed-Precision Quantization Search of BERT [62.65905462141319]
Pre-trained language models such as BERT have shown remarkable effectiveness in various natural language processing tasks.
These models usually contain millions of parameters, which prevents them from practical deployment on resource-constrained devices.
We propose an automatic mixed-precision quantization framework designed for BERT that can simultaneously conduct quantization and pruning in a subgroup-wise level.
arXiv Detail & Related papers (2021-12-30T06:32:47Z) - Automated Model Compression by Jointly Applied Pruning and Quantization [14.824593320721407]
In the traditional deep compression framework, iteratively performing network pruning and quantization can reduce the model size and computation cost.
We tackle this issue by integrating network pruning and quantization as a unified joint compression problem and then use AutoML to automatically solve it.
We propose the automated model compression by jointly applied pruning and quantization (AJPQ)
arXiv Detail & Related papers (2020-11-12T07:06:29Z) - Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size.
A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator.
In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.