Deploying self-supervised learning in the wild for hybrid automatic
speech recognition
- URL: http://arxiv.org/abs/2205.08598v1
- Date: Tue, 17 May 2022 19:37:40 GMT
- Title: Deploying self-supervised learning in the wild for hybrid automatic
speech recognition
- Authors: Mostafa Karimi, Changliang Liu, Kenichi Kumatani, Yao Qian, Tianyu Wu,
Jian Wu
- Abstract summary: Self-supervised learning (SSL) methods have proven to be very successful in automatic speech recognition (ASR)
We show how to utilize untranscribed audio data in SSL from data pre-processing to deploying an streaming hybrid ASR model.
- Score: 20.03807843795386
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Self-supervised learning (SSL) methods have proven to be very successful in
automatic speech recognition (ASR). These great improvements have been reported
mostly based on highly curated datasets such as LibriSpeech for non-streaming
End-to-End ASR models. However, the pivotal characteristics of SSL is to be
utilized for any untranscribed audio data. In this paper, we provide a full
exploration on how to utilize uncurated audio data in SSL from data
pre-processing to deploying an streaming hybrid ASR model. More specifically,
we present (1) the effect of Audio Event Detection (AED) model in data
pre-processing pipeline (2) analysis on choosing optimizer and learning rate
scheduling (3) comparison of recently developed contrastive losses, (4)
comparison of various pre-training strategies such as utilization of in-domain
versus out-domain pre-training data, monolingual versus multilingual
pre-training data, multi-head multilingual SSL versus single-head multilingual
SSL and supervised pre-training versus SSL. The experimental results show that
SSL pre-training with in-domain uncurated data can achieve better performance
in comparison to all the alternative out-domain pre-training strategies.
Related papers
- A Closer Look at Benchmarking Self-Supervised Pre-training with Image Classification [51.35500308126506]
Self-supervised learning (SSL) is a machine learning approach where the data itself provides supervision, eliminating the need for external labels.
We study how classification-based evaluation protocols for SSL correlate and how well they predict downstream performance on different dataset types.
arXiv Detail & Related papers (2024-07-16T23:17:36Z) - Self-supervised Adaptive Pre-training of Multilingual Speech Models for
Language and Dialect Identification [19.893213508284813]
Self-supervised adaptive pre-training is proposed to adapt the pre-trained model to the target domain and languages of the downstream task.
We show that SAPT improves XLSR performance on the FLEURS benchmark with substantial gains up to 40.1% for under-represented languages.
arXiv Detail & Related papers (2023-12-12T14:58:08Z) - Mispronunciation detection using self-supervised speech representations [10.010024759851142]
We study the use of SSL models for the task of mispronunciation detection for second language learners.
We compare two downstream approaches: 1) training the model for phone recognition using native English data, and 2) training a model directly for the target task using non-native English data.
arXiv Detail & Related papers (2023-07-30T21:20:58Z) - Simultaneous or Sequential Training? How Speech Representations
Cooperate in a Multi-Task Self-Supervised Learning System [12.704529528199064]
Recent work combined self-supervised learning (SSL) and visually grounded speech (VGS) processing mechanisms for representation learning.
We study the joint optimization of wav2vec 2.0-based SSL and transformer-based VGS as a multi-task learning system.
arXiv Detail & Related papers (2023-06-05T15:35:19Z) - Bridging the Gap between Language Models and Cross-Lingual Sequence
Labeling [101.74165219364264]
Large-scale cross-lingual pre-trained language models (xPLMs) have shown effectiveness in cross-lingual sequence labeling tasks.
Despite the great success, we draw an empirical observation that there is a training objective gap between pre-training and fine-tuning stages.
In this paper, we first design a pre-training task tailored for xSL named Cross-lingual Language Informative Span Masking (CLISM) to eliminate the objective gap.
Second, we present ContrAstive-Consistency Regularization (CACR), which utilizes contrastive learning to encourage the consistency between representations of input parallel
arXiv Detail & Related papers (2022-04-11T15:55:20Z) - Automatic Pronunciation Assessment using Self-Supervised Speech
Representation Learning [13.391307807956673]
We propose a novel automatic pronunciation assessment method based on self-supervised learning (SSL) models.
First, the proposed method fine-tunes the pre-trained SSL models with connectionist temporal classification to adapt the English pronunciation of English-as-a-second-language learners.
We show that the proposed SSL model-based methods outperform the baselines, in terms of the Pearson correlation coefficient, on datasets of Korean ESL learner children and Speechocean762.
arXiv Detail & Related papers (2022-04-08T06:13:55Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Analyzing the factors affecting usefulness of Self-Supervised
Pre-trained Representations for Speech Recognition [1.0705399532413615]
Self-supervised learning (SSL) to learn high-level speech representations has been a popular approach to building Automatic Speech Recognition systems.
We study the effect of domain, language, dataset size, and other aspects of our upstream pre-training SSL data on the final performance low-resource downstream ASR task.
arXiv Detail & Related papers (2022-03-31T11:48:24Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - UniSpeech-SAT: Universal Speech Representation Learning with Speaker
Aware Pre-Training [72.004873454347]
Two methods are introduced for enhancing the unsupervised speaker information extraction.
Experiment results on SUPERB benchmark show that the proposed system achieves state-of-the-art performance.
We scale up training dataset to 94 thousand hours public audio data and achieve further performance improvement.
arXiv Detail & Related papers (2021-10-12T05:43:30Z) - LeBenchmark: A Reproducible Framework for Assessing Self-Supervised
Representation Learning from Speech [63.84741259993937]
Self-Supervised Learning (SSL) using huge unlabeled data has been successfully explored for image and natural language processing.
Recent works also investigated SSL from speech.
We propose LeBenchmark: a reproducible framework for assessing SSL from speech.
arXiv Detail & Related papers (2021-04-23T08:27:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.