Related papers: Efficient Adapter Transfer of Self-Supervised Speech Models for Automatic Speech Recognition

Efficient Adapter Transfer of Self-Supervised Speech Models for Automatic Speech Recognition

URL: http://arxiv.org/abs/2202.03218v1
Date: Mon, 7 Feb 2022 14:20:54 GMT
Title: Efficient Adapter Transfer of Self-Supervised Speech Models for Automatic Speech Recognition
Authors: Bethan Thomas, Samuel Kessler, Salah Karout
Abstract summary: Transformer based models such as wav2vec 2.0 and HuBERT are leading the field in the speech domain. We propose applying adapters to wav2vec 2.0 to reduce the number of parameters required for downstream ASR tasks.
Score: 0.1909808926064466
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Self-supervised learning (SSL) is a powerful tool that allows learning of underlying representations from unlabeled data. Transformer based models such as wav2vec 2.0 and HuBERT are leading the field in the speech domain. Generally these models are fine-tuned on a small amount of labeled data for a downstream task such as Automatic Speech Recognition (ASR). This involves re-training the majority of the model for each task. Adapters are small lightweight modules which are commonly used in Natural Language Processing (NLP) to adapt pre-trained models to new tasks. In this paper we propose applying adapters to wav2vec 2.0 to reduce the number of parameters required for downstream ASR tasks, and increase scalability of the model to multiple tasks or languages. Using adapters we can perform ASR while training fewer than 10% of parameters per task compared to full fine-tuning with little degradation of performance. Ablations show that applying adapters into just the top few layers of the pre-trained network gives similar performance to full transfer, supporting the theory that higher pre-trained layers encode more phonemic information, and further optimizing efficiency.

Related papers

How to Learn a New Language? An Efficient Solution for Self-Supervised Learning Models Unseen Languages Adaption in Low-Resource Scenario [72.02391485962127]
Speech Self-Supervised Learning (SSL) models achieve impressive performance on Automatic Speech Recognition (ASR) In low-resource language ASR, they encounter the domain mismatch problem between pre-trained and low-resource languages. We extend a conventional efficient fine-tuning scheme based on the adapter to handle these issues.
arXiv Detail & Related papers (2024-11-27T10:51:00Z)
eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception. Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency. We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z)
Evaluating Parameter-Efficient Transfer Learning Approaches on SURE Benchmark for Speech Understanding [40.27182770995891]
Fine-tuning is widely used as the default algorithm for transfer learning from pre-trained models. We introduce the Speech UndeRstanding Evaluation (SURE) benchmark for parameter-efficient learning for various speech-processing tasks.
arXiv Detail & Related papers (2023-03-02T08:57:33Z)
UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling [49.134517040512414]
This paper proposes UniAdapter, which unifies unimodal and multimodal adapters for parameter-efficient cross-modal adaptation on vision-language models. Experiments show that UniAdapter not only outperforms the state-of-the-arts, but even beats the full fine-tuning strategy.
arXiv Detail & Related papers (2023-02-13T18:59:10Z)
CHAPTER: Exploiting Convolutional Neural Network Adapters for Self-supervised Speech Models [62.60723685118747]
Self-supervised learning (SSL) is a powerful technique for learning representations from unlabeled data. We propose an efficient tuning method specifically designed for SSL speech model, by applying CNN adapters at the feature extractor. We empirically found that adding CNN to the feature extractor can help the adaptation on emotion and speaker tasks.
arXiv Detail & Related papers (2022-12-01T08:50:12Z)
Exploring Efficient-tuning Methods in Self-supervised Speech Models [53.633222197712875]
Self-supervised learning can learn powerful representations for different speech tasks. In downstream tasks, the parameters of SSL models are frozen, and only the adapters are trained. We show that the performance parity can be achieved with over 90% parameter reduction.
arXiv Detail & Related papers (2022-10-10T11:08:12Z)
AdaMix: Mixture-of-Adapter for Parameter-efficient Tuning of Large Language Models [119.7093605087114]
Fine-tuning large-scale pre-trained language models to downstream tasks require updating hundreds of millions of parameters. This not only increases the serving cost to store a large copy of the model weights for every task, but also exhibits instability during few-shot task adaptation. We introduce a new mechanism to improve adapter capacity without increasing parameters or computational cost by two key techniques.
arXiv Detail & Related papers (2022-05-24T23:41:22Z)
AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks [55.705355299065474]
Transformer-based pre-trained models with millions of parameters require large storage. Recent approaches tackle this shortcoming by training adapters, but these approaches still require a relatively large number of parameters. In this study, AdapterBias, a surprisingly simple yet effective adapter architecture, is proposed.
arXiv Detail & Related papers (2022-04-30T16:49:41Z)
Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for Low-resource Speech Recognition [9.732767611907068]
In this work, we fuse a pre-trained acoustic encoder (wav2vec2.0) and a pre-trained linguistic encoder (BERT) into an end-to-end ASR model. Our model achieves better recognition performance on CALLHOME corpus (15 hours) than other end-to-end models.
arXiv Detail & Related papers (2021-01-17T16:12:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.