Efficient Adapter Transfer of Self-Supervised Speech Models for
Automatic Speech Recognition
- URL: http://arxiv.org/abs/2202.03218v1
- Date: Mon, 7 Feb 2022 14:20:54 GMT
- Title: Efficient Adapter Transfer of Self-Supervised Speech Models for
Automatic Speech Recognition
- Authors: Bethan Thomas, Samuel Kessler, Salah Karout
- Abstract summary: Transformer based models such as wav2vec 2.0 and HuBERT are leading the field in the speech domain.
We propose applying adapters to wav2vec 2.0 to reduce the number of parameters required for downstream ASR tasks.
- Score: 0.1909808926064466
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-supervised learning (SSL) is a powerful tool that allows learning of
underlying representations from unlabeled data. Transformer based models such
as wav2vec 2.0 and HuBERT are leading the field in the speech domain. Generally
these models are fine-tuned on a small amount of labeled data for a downstream
task such as Automatic Speech Recognition (ASR). This involves re-training the
majority of the model for each task. Adapters are small lightweight modules
which are commonly used in Natural Language Processing (NLP) to adapt
pre-trained models to new tasks. In this paper we propose applying adapters to
wav2vec 2.0 to reduce the number of parameters required for downstream ASR
tasks, and increase scalability of the model to multiple tasks or languages.
Using adapters we can perform ASR while training fewer than 10% of parameters
per task compared to full fine-tuning with little degradation of performance.
Ablations show that applying adapters into just the top few layers of the
pre-trained network gives similar performance to full transfer, supporting the
theory that higher pre-trained layers encode more phonemic information, and
further optimizing efficiency.
Related papers
- eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Evaluating Parameter-Efficient Transfer Learning Approaches on SURE
Benchmark for Speech Understanding [40.27182770995891]
Fine-tuning is widely used as the default algorithm for transfer learning from pre-trained models.
We introduce the Speech UndeRstanding Evaluation (SURE) benchmark for parameter-efficient learning for various speech-processing tasks.
arXiv Detail & Related papers (2023-03-02T08:57:33Z) - UniAdapter: Unified Parameter-Efficient Transfer Learning for
Cross-modal Modeling [49.134517040512414]
This paper proposes UniAdapter, which unifies unimodal and multimodal adapters for parameter-efficient cross-modal adaptation on vision-language models.
Experiments show that UniAdapter not only outperforms the state-of-the-arts, but even beats the full fine-tuning strategy.
arXiv Detail & Related papers (2023-02-13T18:59:10Z) - CHAPTER: Exploiting Convolutional Neural Network Adapters for
Self-supervised Speech Models [62.60723685118747]
Self-supervised learning (SSL) is a powerful technique for learning representations from unlabeled data.
We propose an efficient tuning method specifically designed for SSL speech model, by applying CNN adapters at the feature extractor.
We empirically found that adding CNN to the feature extractor can help the adaptation on emotion and speaker tasks.
arXiv Detail & Related papers (2022-12-01T08:50:12Z) - Exploring Efficient-tuning Methods in Self-supervised Speech Models [53.633222197712875]
Self-supervised learning can learn powerful representations for different speech tasks.
In downstream tasks, the parameters of SSL models are frozen, and only the adapters are trained.
We show that the performance parity can be achieved with over 90% parameter reduction.
arXiv Detail & Related papers (2022-10-10T11:08:12Z) - AdaMix: Mixture-of-Adapter for Parameter-efficient Tuning of Large
Language Models [119.7093605087114]
Fine-tuning large-scale pre-trained language models to downstream tasks require updating hundreds of millions of parameters.
This not only increases the serving cost to store a large copy of the model weights for every task, but also exhibits instability during few-shot task adaptation.
We introduce a new mechanism to improve adapter capacity without increasing parameters or computational cost by two key techniques.
arXiv Detail & Related papers (2022-05-24T23:41:22Z) - AdapterBias: Parameter-efficient Token-dependent Representation Shift
for Adapters in NLP Tasks [55.705355299065474]
Transformer-based pre-trained models with millions of parameters require large storage.
Recent approaches tackle this shortcoming by training adapters, but these approaches still require a relatively large number of parameters.
In this study, AdapterBias, a surprisingly simple yet effective adapter architecture, is proposed.
arXiv Detail & Related papers (2022-04-30T16:49:41Z) - Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for
Low-resource Speech Recognition [9.732767611907068]
In this work, we fuse a pre-trained acoustic encoder (wav2vec2.0) and a pre-trained linguistic encoder (BERT) into an end-to-end ASR model.
Our model achieves better recognition performance on CALLHOME corpus (15 hours) than other end-to-end models.
arXiv Detail & Related papers (2021-01-17T16:12:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.