Beyond Universal Transformer: block reusing with adaptor in Transformer
for automatic speech recognition
- URL: http://arxiv.org/abs/2303.13072v2
- Date: Wed, 5 Apr 2023 08:36:34 GMT
- Title: Beyond Universal Transformer: block reusing with adaptor in Transformer
for automatic speech recognition
- Authors: Haoyu Tang, Zhaoyi Liu, Chang Zeng, Xinfeng Li
- Abstract summary: We propose a solution that can reuse the block in Transformer models for the application of ASR on edge devices.
Specifically, we design a novel block-reusing strategy for speech Transformer (BRST) to enhance the effectiveness of parameters.
- Score: 2.5680214354539803
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based models have recently made significant achievements in the
application of end-to-end (E2E) automatic speech recognition (ASR). It is
possible to deploy the E2E ASR system on smart devices with the help of
Transformer-based models. While these models still have the disadvantage of
requiring a large number of model parameters. To overcome the drawback of
universal Transformer models for the application of ASR on edge devices, we
propose a solution that can reuse the block in Transformer models for the
occasion of the small footprint ASR system, which meets the objective of
accommodating resource limitations without compromising recognition accuracy.
Specifically, we design a novel block-reusing strategy for speech Transformer
(BRST) to enhance the effectiveness of parameters and propose an adapter module
(ADM) that can produce a compact and adaptable model with only a few additional
trainable parameters accompanying each reusing block. We conducted an
experiment with the proposed method on the public AISHELL-1 corpus, and the
results show that the proposed approach achieves the character error rate (CER)
of 9.3%/6.63% with only 7.6M/8.3M parameters without and with the ADM,
respectively. In addition, we also make a deeper analysis to show the effect of
ADM in the general block-reusing method.
Related papers
- SMPLer: Taming Transformers for Monocular 3D Human Shape and Pose Estimation [74.07836010698801]
We propose an SMPL-based Transformer framework (SMPLer) to address this issue.
SMPLer incorporates two key ingredients: a decoupled attention operation and an SMPL-based target representation.
Extensive experiments demonstrate the effectiveness of SMPLer against existing 3D human shape and pose estimation methods.
arXiv Detail & Related papers (2024-04-23T17:59:59Z) - External Prompt Features Enhanced Parameter-efficient Fine-tuning for Salient Object Detection [6.5971464769307495]
Salient object detection (SOD) aims at finding the most salient objects in images and outputs pixel-level binary masks.
Transformer-based methods achieve promising performance due to their global semantic understanding.
We propose a novel parameter-efficient fine-tuning method aimed at reducing the number of training parameters.
arXiv Detail & Related papers (2024-04-23T13:15:07Z) - SDPose: Tokenized Pose Estimation via Circulation-Guide Self-Distillation [53.675725490807615]
We introduce SDPose, a new self-distillation method for improving the performance of small transformer-based models.
SDPose-T obtains 69.7% mAP with 4.4M parameters and 1.8 GFLOPs, while SDPose-S-V2 obtains 73.5% mAP on the MSCOCO validation dataset.
arXiv Detail & Related papers (2024-04-04T15:23:14Z) - Parameter-Efficient Conformers via Sharing Sparsely-Gated Experts for
End-to-End Speech Recognition [17.73449206184214]
This paper proposes a parameter-efficient conformer via sharing sparsely-gated experts.
Specifically, we use sparsely-gated mixture-of-experts (MoE) to extend the capacity of a conformer block without increasing.
arXiv Detail & Related papers (2022-09-17T13:22:19Z) - AdaMix: Mixture-of-Adapter for Parameter-efficient Tuning of Large
Language Models [119.7093605087114]
Fine-tuning large-scale pre-trained language models to downstream tasks require updating hundreds of millions of parameters.
This not only increases the serving cost to store a large copy of the model weights for every task, but also exhibits instability during few-shot task adaptation.
We introduce a new mechanism to improve adapter capacity without increasing parameters or computational cost by two key techniques.
arXiv Detail & Related papers (2022-05-24T23:41:22Z) - Bilaterally Slimmable Transformer for Elastic and Efficient Visual
Question Answering [75.86788916930377]
bilaterally slimmable Transformer (BST) integrated into arbitrary Transformer-based VQA models.
One slimmed MCAN-BST submodel achieves comparable accuracy on VQA-v2.
Smallest MCAN-BST submodel has 9M parameters and 0.16G FLOPs during inference.
arXiv Detail & Related papers (2022-03-24T02:26:04Z) - EdgeFormer: A Parameter-Efficient Transformer for On-Device Seq2seq
Generation [104.44478403427881]
EdgeFormer is a parameter-efficient Transformer of the encoder-decoder architecture for on-device seq2seq generation.
We conduct experiments on two practical on-device seq2seq tasks: Machine Translation and Grammatical Error Correction.
arXiv Detail & Related papers (2022-02-16T10:10:00Z) - Bayesian Transformer Language Models for Speech Recognition [59.235405107295655]
State-of-the-art neural language models (LMs) represented by Transformers are highly complex.
This paper proposes a full Bayesian learning framework for Transformer LM estimation.
arXiv Detail & Related papers (2021-02-09T10:55:27Z) - Subformer: Exploring Weight Sharing for Parameter Efficiency in
Generative Transformers [16.88840622945725]
We develop the Subformer, a parameter efficient Transformer-based model.
Experiments on machine translation, abstractive summarization, and language modeling show that the Subformer can outperform the Transformer even when using significantly fewer parameters.
arXiv Detail & Related papers (2021-01-01T13:53:22Z) - Adaptable Multi-Domain Language Model for Transformer ASR [16.8397357399749]
The proposed model can reuse the full fine-tuned LM which is fine-tuned using all layers of an original model.
The proposed model is also effective in reducing the model maintenance cost because it is possible to omit the costly and time-consuming common LM pre-training process.
arXiv Detail & Related papers (2020-08-14T06:33:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.