INT8 Winograd Acceleration for Conv1D Equipped ASR Models Deployed on
Mobile Devices
- URL: http://arxiv.org/abs/2010.14841v1
- Date: Wed, 28 Oct 2020 09:25:49 GMT
- Title: INT8 Winograd Acceleration for Conv1D Equipped ASR Models Deployed on
Mobile Devices
- Authors: Yiwu Yao, Yuchao Li, Chengyu Wang, Tianhang Yu, Houjiang Chen,
Xiaotang Jiang, Jun Yang, Jun Huang, Wei Lin, Hui Shu, Chengfei Lv
- Abstract summary: The intensive computation of Automatic Speech Recognition (ASR) models obstructs them from being deployed on mobile devices.
We present a novel quantized Winograd optimization pipeline, which combines the quantization and fast convolution to achieve efficient inference acceleration on mobile devices for ASR models.
- Score: 16.13681155725083
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The intensive computation of Automatic Speech Recognition (ASR) models
obstructs them from being deployed on mobile devices. In this paper, we present
a novel quantized Winograd optimization pipeline, which combines the
quantization and fast convolution to achieve efficient inference acceleration
on mobile devices for ASR models. To avoid the information loss due to the
combination of quantization and Winograd convolution, a Range-Scaled
Quantization (RSQ) training method is proposed to expand the quantized
numerical range and to distill knowledge from high-precision values. Moreover,
an improved Conv1D equipped DFSMN (ConvDFSMN) model is designed for mobile
deployment. We conduct extensive experiments on both ConvDFSMN and Wav2letter
models. Results demonstrate the models can be effectively optimized with the
proposed pipeline. Especially, Wav2letter achieves 1.48* speedup with an
approximate 0.07% WER decrease on ARMv7-based mobile devices.
Related papers
- A light-weight and efficient punctuation and word casing prediction model for on-device streaming ASR [0.31077024712075796]
Punctuation and word casing prediction are necessary for automatic speech recognition (ASR)
We propose a light-weight and efficient model that jointly predicts punctuation and word casing in real time.
arXiv Detail & Related papers (2024-07-18T04:01:12Z) - Task-Agnostic Structured Pruning of Speech Representation Models [18.555223754089905]
We propose a fine-grained attention head pruning method to compensate for the performance degradation.
Experiments on the SUPERB benchmark show that our model can achieve comparable performance to the dense model in multiple tasks.
arXiv Detail & Related papers (2023-06-02T09:11:06Z) - Operator Splitting Value Iteration [27.505231431328255]
We introduce Operator Splitting Value Iteration (OS-VI) for both Policy Evaluation and Control problems.
OS-VI achieves a much faster convergence rate when the model is accurate enough.
Unlike the traditional Dyna architecture, OS-Dyna still converges to the correct value function in presence of model approximation error.
arXiv Detail & Related papers (2022-11-25T07:34:26Z) - ANT: Exploiting Adaptive Numerical Data Type for Low-bit Deep Neural
Network Quantization [31.494669469303954]
We propose a fixed-length adaptive numerical data type called ANT to achieve low-bit quantization with tiny hardware overheads.
Our design results in 2.8$times$ speedup and 2.5$times$ energy efficiency improvement over the state-of-the-art quantization accelerators.
arXiv Detail & Related papers (2022-08-30T14:12:49Z) - EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for
Mobile Vision Applications [68.35683849098105]
We introduce split depth-wise transpose attention (SDTA) encoder that splits input tensors into multiple channel groups.
Our EdgeNeXt model with 1.3M parameters achieves 71.2% top-1 accuracy on ImageNet-1K.
Our EdgeNeXt model with 5.6M parameters achieves 79.4% top-1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-06-21T17:59:56Z) - On-demand compute reduction with stochastic wav2vec 2.0 [63.22845151306881]
We propose compression for on-demand compute reduction for wav2vec 2.0 (W2V2) models.
Our results for models pre-trained on 960h Librispeech dataset and fine-tuned on 10h of transcribed data show that using the same model, we get a smooth trade-off between word error rate (WER) and inference time.
arXiv Detail & Related papers (2022-04-25T19:25:46Z) - VAQF: Fully Automatic Software-hardware Co-design Framework for Low-bit
Vision Transformer [121.85581713299918]
We propose VAQF, a framework that builds inference accelerators on FPGA platforms for quantized Vision Transformers (ViTs)
Given the model structure and the desired frame rate, VAQF will automatically output the required quantization precision for activations.
This is the first time quantization has been incorporated into ViT acceleration on FPGAs.
arXiv Detail & Related papers (2022-01-17T20:27:52Z) - MoEfication: Conditional Computation of Transformer Models for Efficient
Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost.
We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon.
We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z) - Edge Federated Learning Via Unit-Modulus Over-The-Air Computation
(Extended Version) [64.76619508293966]
This paper proposes a unit-modulus over-the-air computation (UM-AirComp) framework to facilitate efficient edge federated learning.
It uploads simultaneously local model parameters and updates global model parameters via analog beamforming.
We demonstrate the implementation of UM-AirComp in a vehicle-to-everything autonomous driving simulation platform.
arXiv Detail & Related papers (2021-01-28T15:10:22Z) - Fast-Convergent Federated Learning [82.32029953209542]
Federated learning is a promising solution for distributing machine learning tasks through modern networks of mobile devices.
We propose a fast-convergent federated learning algorithm, called FOLB, which performs intelligent sampling of devices in each round of model training.
arXiv Detail & Related papers (2020-07-26T14:37:51Z) - Searching for Winograd-aware Quantized Networks [12.351250944079949]
We propose a Winograd-aware formulation of convolution layers which exposes the numerical inaccuracies introduced by the Winograd transformations.
We also address the source of the numerical error and propose a relaxation on the form of the transformation matrices, resulting in up to 10% higher classification accuracy on CIFAR-10.
arXiv Detail & Related papers (2020-02-25T07:53:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.