Prompting and Adapter Tuning for Self-supervised Encoder-Decoder Speech
Model
- URL: http://arxiv.org/abs/2310.02971v3
- Date: Tue, 14 Nov 2023 21:15:01 GMT
- Title: Prompting and Adapter Tuning for Self-supervised Encoder-Decoder Speech
Model
- Authors: Kai-Wei Chang, Ming-Hsin Chen, Yun-Ping Lin, Jing Neng Hsu, Paul
Kuo-Ming Huang, Chien-yu Huang, Shang-Wen Li, Hung-yi Lee
- Abstract summary: We show that prompting on Wav2Seq, a self-supervised encoder-decoder model, surpasses previous works in sequence generation tasks.
It achieves a remarkable 53% relative improvement in word error rate for ASR and a 27% in F1 score for slot filling.
We also show the transferability of prompting and adapter tuning on Wav2Seq in cross-lingual ASR.
- Score: 84.12646619522774
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prompting and adapter tuning have emerged as efficient alternatives to
fine-tuning (FT) methods. However, existing studies on speech prompting focused
on classification tasks and failed on more complex sequence generation tasks.
Besides, adapter tuning is primarily applied with a focus on encoder-only
self-supervised models. Our experiments show that prompting on Wav2Seq, a
self-supervised encoder-decoder model, surpasses previous works in sequence
generation tasks. It achieves a remarkable 53% relative improvement in word
error rate for ASR and a 27% in F1 score for slot filling. Additionally,
prompting competes with the FT method in the low-resource scenario. Moreover,
we show the transferability of prompting and adapter tuning on Wav2Seq in
cross-lingual ASR. When limited trainable parameters are involved, prompting
and adapter tuning consistently outperform conventional FT across 7 languages.
Notably, in the low-resource scenario, prompting consistently outperforms
adapter tuning.
Related papers
- iConFormer: Dynamic Parameter-Efficient Tuning with Input-Conditioned Adaptation [15.97351561456467]
In this paper, we propose a novel PEFT approach, input-Conditioned transFormer, termed iConFormer.
We introduce an input-Conditioned Network (iCoN) in the dynamic adapter that enables instance-level feature transformation.
To be specific, iCoN generates channel-wise convolutional kernels for each feature and transform it using adaptive convolution process to effectively capture task-specific and fine-grained details tailor to downstream tasks.
arXiv Detail & Related papers (2024-09-04T16:06:23Z) - ELP-Adapters: Parameter Efficient Adapter Tuning for Various Speech Processing Tasks [10.852047082856487]
We introduce ELP-adapter tuning, a novel method for parameter-efficient fine-tuning using three types of adapters.
E-adapters are integrated into transformer-based encoder layers and help to learn fine-grained speech representations that are effective for speech recognition.
L-adapters create paths from each encoder layer to the downstream head and help to extract non-linguistic features from lower encoder layers.
The P-adapter appends pseudo features to CNN features to further improve effectiveness and efficiency.
arXiv Detail & Related papers (2024-07-28T05:26:03Z) - Hierarchical Recurrent Adapters for Efficient Multi-Task Adaptation of Large Speech Models [12.230087530720652]
We introduce an adapter module that has a better efficiency in large scale multi-task adaptation scenario.
The adapter consists of a single shared controller network and multiple task-level adapter heads.
arXiv Detail & Related papers (2024-03-25T17:21:56Z) - CHAPTER: Exploiting Convolutional Neural Network Adapters for
Self-supervised Speech Models [62.60723685118747]
Self-supervised learning (SSL) is a powerful technique for learning representations from unlabeled data.
We propose an efficient tuning method specifically designed for SSL speech model, by applying CNN adapters at the feature extractor.
We empirically found that adding CNN to the feature extractor can help the adaptation on emotion and speaker tasks.
arXiv Detail & Related papers (2022-12-01T08:50:12Z) - SparseAdapter: An Easy Approach for Improving the Parameter-Efficiency
of Adapters [96.52807311742198]
We re-examine the parameter-efficiency of Adapters through the lens of network pruning.
We find that SparseAdapter can achieve comparable or better performance than standard Adapters when the sparse ratio reaches up to 80%.
arXiv Detail & Related papers (2022-10-09T15:28:48Z) - AdapterBias: Parameter-efficient Token-dependent Representation Shift
for Adapters in NLP Tasks [55.705355299065474]
Transformer-based pre-trained models with millions of parameters require large storage.
Recent approaches tackle this shortcoming by training adapters, but these approaches still require a relatively large number of parameters.
In this study, AdapterBias, a surprisingly simple yet effective adapter architecture, is proposed.
arXiv Detail & Related papers (2022-04-30T16:49:41Z) - Efficient Adapter Transfer of Self-Supervised Speech Models for
Automatic Speech Recognition [0.1909808926064466]
Transformer based models such as wav2vec 2.0 and HuBERT are leading the field in the speech domain.
We propose applying adapters to wav2vec 2.0 to reduce the number of parameters required for downstream ASR tasks.
arXiv Detail & Related papers (2022-02-07T14:20:54Z) - On the Effectiveness of Adapter-based Tuning for Pretrained Language
Model Adaptation [36.37565646597464]
adapter-based tuning works by adding light-weight adapter modules to a pretrained language model (PrLM)
It adds only a few trainable parameters per new task, allowing a high degree of parameter sharing.
We demonstrate that adapter-based tuning outperforms fine-tuning on low-resource and cross-lingual tasks.
arXiv Detail & Related papers (2021-06-06T16:10:12Z) - Advanced Long-context End-to-end Speech Recognition Using
Context-expanded Transformers [56.56220390953412]
We extend our prior work by introducing the Conformer architecture to further improve the accuracy.
We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance.
arXiv Detail & Related papers (2021-04-19T16:18:00Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.