Front-End Adapter: Adapting Front-End Input of Speech based
Self-Supervised Learning for Speech Recognition
- URL: http://arxiv.org/abs/2302.09331v1
- Date: Sat, 18 Feb 2023 13:46:12 GMT
- Title: Front-End Adapter: Adapting Front-End Input of Speech based
Self-Supervised Learning for Speech Recognition
- Authors: Xie Chen, Ziyang Ma, Changli Tang, Yujin Wang, Zhisheng Zheng
- Abstract summary: Speech based SSL models present promising performance in a range of speech related tasks.
It is essential to use consistent front-end input during pre-training and fine-tuning.
We propose a simple but effective front-end adapter to address this front-end discrepancy.
- Score: 6.238268985570237
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent years have witnessed a boom in self-supervised learning (SSL) in
various areas including speech processing. Speech based SSL models present
promising performance in a range of speech related tasks. However, the training
of SSL models is computationally expensive and a common practice is to
fine-tune a released SSL model on the specific task. It is essential to use
consistent front-end input during pre-training and fine-tuning. This
consistency may introduce potential issues when the optimal front-end is not
the same as that used in pre-training. In this paper, we propose a simple but
effective front-end adapter to address this front-end discrepancy. By
minimizing the distance between the outputs of different front-ends, the
filterbank feature (Fbank) can be compatible with SSL models which are
pre-trained with waveform. The experiment results demonstrate the effectiveness
of our proposed front-end adapter on several popular SSL models for the speech
recognition task.
Related papers
- CHAPTER: Exploiting Convolutional Neural Network Adapters for
Self-supervised Speech Models [62.60723685118747]
Self-supervised learning (SSL) is a powerful technique for learning representations from unlabeled data.
We propose an efficient tuning method specifically designed for SSL speech model, by applying CNN adapters at the feature extractor.
We empirically found that adding CNN to the feature extractor can help the adaptation on emotion and speaker tasks.
arXiv Detail & Related papers (2022-12-01T08:50:12Z) - M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for
Multilingual Speech to Image Retrieval [56.49878599920353]
This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval.
For non-English image-speech retrieval, we outperform the current state-of-the-art performance by a wide margin both when training separate models for each language, and with a single model which processes speech in all three languages.
arXiv Detail & Related papers (2022-11-02T14:54:45Z) - Exploring Efficient-tuning Methods in Self-supervised Speech Models [53.633222197712875]
Self-supervised learning can learn powerful representations for different speech tasks.
In downstream tasks, the parameters of SSL models are frozen, and only the adapters are trained.
We show that the performance parity can be achieved with over 90% parameter reduction.
arXiv Detail & Related papers (2022-10-10T11:08:12Z) - Match to Win: Analysing Sequences Lengths for Efficient Self-supervised
Learning in Speech and Audio [19.865050806327147]
Self-supervised learning has proven vital in speech and audio-related applications.
This paper provides the first empirical study of SSL pre-training for different specified sequence lengths.
We find that training on short sequences can dramatically reduce resource costs while retaining a satisfactory performance for all tasks.
arXiv Detail & Related papers (2022-09-30T16:35:42Z) - Non-Contrastive Self-supervised Learning for Utterance-Level Information
Extraction from Speech [32.058337220323146]
Self-supervised learning (SSL) of utterance-level speech representation can be used in speech applications.
This paper applies a non-contrastive self-supervised method to learn utterance-level embeddings.
arXiv Detail & Related papers (2022-08-10T16:56:39Z) - Deploying self-supervised learning in the wild for hybrid automatic
speech recognition [20.03807843795386]
Self-supervised learning (SSL) methods have proven to be very successful in automatic speech recognition (ASR)
We show how to utilize untranscribed audio data in SSL from data pre-processing to deploying an streaming hybrid ASR model.
arXiv Detail & Related papers (2022-05-17T19:37:40Z) - An Exploration of Prompt Tuning on Generative Spoken Language Model for
Speech Processing Tasks [112.1942546460814]
We report the first exploration of the prompt tuning paradigm for speech processing tasks based on Generative Spoken Language Model (GSLM)
Experiment results show that the prompt tuning technique achieves competitive performance in speech classification tasks with fewer trainable parameters than fine-tuning specialized downstream models.
arXiv Detail & Related papers (2022-03-31T03:26:55Z) - UniSpeech-SAT: Universal Speech Representation Learning with Speaker
Aware Pre-Training [72.004873454347]
Two methods are introduced for enhancing the unsupervised speaker information extraction.
Experiment results on SUPERB benchmark show that the proposed system achieves state-of-the-art performance.
We scale up training dataset to 94 thousand hours public audio data and achieve further performance improvement.
arXiv Detail & Related papers (2021-10-12T05:43:30Z) - LeBenchmark: A Reproducible Framework for Assessing Self-Supervised
Representation Learning from Speech [63.84741259993937]
Self-Supervised Learning (SSL) using huge unlabeled data has been successfully explored for image and natural language processing.
Recent works also investigated SSL from speech.
We propose LeBenchmark: a reproducible framework for assessing SSL from speech.
arXiv Detail & Related papers (2021-04-23T08:27:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.