Parameter-Efficient Conformers via Sharing Sparsely-Gated Experts for
End-to-End Speech Recognition
- URL: http://arxiv.org/abs/2209.08326v1
- Date: Sat, 17 Sep 2022 13:22:19 GMT
- Title: Parameter-Efficient Conformers via Sharing Sparsely-Gated Experts for
End-to-End Speech Recognition
- Authors: Ye Bai, Jie Li, Wenjing Han, Hao Ni, Kaituo Xu, Zhuo Zhang, Cheng Yi,
Xiaorui Wang
- Abstract summary: This paper proposes a parameter-efficient conformer via sharing sparsely-gated experts.
Specifically, we use sparsely-gated mixture-of-experts (MoE) to extend the capacity of a conformer block without increasing.
- Score: 17.73449206184214
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While transformers and their variant conformers show promising performance in
speech recognition, the parameterized property leads to much memory cost during
training and inference. Some works use cross-layer weight-sharing to reduce the
parameters of the model. However, the inevitable loss of capacity harms the
model performance. To address this issue, this paper proposes a
parameter-efficient conformer via sharing sparsely-gated experts. Specifically,
we use sparsely-gated mixture-of-experts (MoE) to extend the capacity of a
conformer block without increasing computation. Then, the parameters of the
grouped conformer blocks are shared so that the number of parameters is
reduced. Next, to ensure the shared blocks with the flexibility of adapting
representations at different levels, we design the MoE routers and
normalization individually. Moreover, we use knowledge distillation to further
improve the performance. Experimental results show that the proposed model
achieves competitive performance with 1/3 of the parameters of the encoder,
compared with the full-parameter model.
Related papers
- MoS: Unleashing Parameter Efficiency of Low-Rank Adaptation with Mixture of Shards [35.163843138935455]
The rapid scaling of large language models requires more lightweight finetuning methods to reduce the explosive GPU memory overhead.
Our research highlights the indispensable role of differentiation in reversing the detrimental effects of pure sharing.
We propose Mixture of Shards (MoS), incorporating both inter-layer and intra-layer sharing schemes, and integrating four nearly cost-free differentiation strategies.
arXiv Detail & Related papers (2024-10-01T07:47:03Z) - MoEUT: Mixture-of-Experts Universal Transformers [75.96744719516813]
Universal Transformers (UTs) have advantages over standard Transformers in learning compositional generalizations.
Layer-sharing drastically reduces the parameter count compared to the non-shared model with the same dimensionality.
No previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as language modeling.
arXiv Detail & Related papers (2024-05-25T03:24:32Z) - Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation [67.13876021157887]
Dynamic Tuning (DyT) is a novel approach to improve both parameter and inference efficiency for ViT adaptation.
DyT achieves superior performance compared to existing PEFT methods while evoking only 71% of their FLOPs on the VTAB-1K benchmark.
arXiv Detail & Related papers (2024-03-18T14:05:52Z) - Parameter Efficient Fine-tuning via Cross Block Orchestration for Segment Anything Model [81.55141188169621]
We equip PEFT with a cross-block orchestration mechanism to enable the adaptation of the Segment Anything Model (SAM) to various downstream scenarios.
We propose an intra-block enhancement module, which introduces a linear projection head whose weights are generated from a hyper-complex layer.
Our proposed approach consistently improves the segmentation performance significantly on novel scenarios with only around 1K additional parameters.
arXiv Detail & Related papers (2023-11-28T11:23:34Z) - Understanding Parameter Sharing in Transformers [53.75988363281843]
Previous work on Transformers has focused on sharing parameters in different layers, which can improve the performance of models with limited parameters by increasing model depth.
We show that the success of this approach can be largely attributed to better convergence, with only a small part due to the increased model complexity.
Experiments on 8 machine translation tasks show that our model achieves competitive performance with only half the model complexity of parameter sharing models.
arXiv Detail & Related papers (2023-06-15T10:48:59Z) - Parameter-Efficient Fine-Tuning without Introducing New Latency [7.631596468553607]
We introduce a novel adapter technique that directly applies the adapter to pre-trained parameters instead of the hidden representation.
Our proposed method attains a new state-of-the-art outcome in terms of both performance and storage efficiency, storing only 0.03% parameters of full fine-tuning.
arXiv Detail & Related papers (2023-05-26T08:44:42Z) - Towards Being Parameter-Efficient: A Stratified Sparsely Activated
Transformer with Dynamic Capacity [37.04254056062765]
Stratified Mixture of Experts (SMoE) models can assign dynamic capacity to different tokens.
We show SMoE outperforms multiple state-of-the-art MoE models with the same or fewer parameters.
arXiv Detail & Related papers (2023-05-03T15:18:18Z) - Consolidator: Mergeable Adapter with Grouped Connections for Visual
Adaptation [53.835365470800916]
We show how to efficiently and effectively transfer knowledge in a vision transformer.
We propose consolidator to modify the pre-trained model with the addition of a small set of tunable parameters.
Our consolidator can reach up to 7.56 better accuracy than full fine-tuning with merely 0.35% parameters.
arXiv Detail & Related papers (2023-04-30T23:59:02Z) - Subformer: Exploring Weight Sharing for Parameter Efficiency in
Generative Transformers [16.88840622945725]
We develop the Subformer, a parameter efficient Transformer-based model.
Experiments on machine translation, abstractive summarization, and language modeling show that the Subformer can outperform the Transformer even when using significantly fewer parameters.
arXiv Detail & Related papers (2021-01-01T13:53:22Z) - Efficient End-to-End Speech Recognition Using Performers in Conformers [74.71219757585841]
We propose to reduce the complexity of model architectures in addition to model sizes.
The proposed model yields competitive performance on the LibriSpeech corpus with 10 millions of parameters and linear complexity.
arXiv Detail & Related papers (2020-11-09T05:22:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.