Online Continual Learning of End-to-End Speech Recognition Models
- URL: http://arxiv.org/abs/2207.05071v1
- Date: Mon, 11 Jul 2022 05:35:06 GMT
- Title: Online Continual Learning of End-to-End Speech Recognition Models
- Authors: Muqiao Yang, Ian Lane, Shinji Watanabe
- Abstract summary: Continual Learning aims to continually learn from new data as it becomes available.
We show that with online continual learning and a selective sampling strategy, we can maintain an accuracy similar to retraining a model from scratch.
- Score: 29.931427687979532
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Continual Learning, also known as Lifelong Learning, aims to continually
learn from new data as it becomes available. While prior research on continual
learning in automatic speech recognition has focused on the adaptation of
models across multiple different speech recognition tasks, in this paper we
propose an experimental setting for \textit{online continual learning} for
automatic speech recognition of a single task. Specifically focusing on the
case where additional training data for the same task becomes available
incrementally over time, we demonstrate the effectiveness of performing
incremental model updates to end-to-end speech recognition models with an
online Gradient Episodic Memory (GEM) method. Moreover, we show that with
online continual learning and a selective sampling strategy, we can maintain an
accuracy that is similar to retraining a model from scratch while requiring
significantly lower computation costs. We have also verified our method with
self-supervised learning (SSL) features.
Related papers
- Improving Pretrained YAMNet for Enhanced Speech Command Detection via Transfer Learning [0.23408308015481666]
We adapt and train a YAMNet deep learning model to effectively detect and interpret speech commands from audio signals.
The final model achieved a recognition accuracy of 95.28%, underscoring the impact of advanced machine learning techniques.
arXiv Detail & Related papers (2025-04-26T21:57:11Z) - Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs)
We present a simple yet effective automatic process for creating speech-text pair data.
Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z) - Continuously Learning New Words in Automatic Speech Recognition [56.972851337263755]
We propose an self-supervised continual learning approach to recognize new words.
We use a memory-enhanced Automatic Speech Recognition model from previous work.
We show that with this approach, we obtain increasing performance on the new words when they occur more frequently.
arXiv Detail & Related papers (2024-01-09T10:39:17Z) - CSSL-MHTR: Continual Self-Supervised Learning for Scalable Multi-script Handwritten Text Recognition [16.987008461171065]
We explore the potential of continual self-supervised learning to alleviate the catastrophic forgetting problem in handwritten text recognition.
Our method consists in adding intermediate layers called adapters for each task, and efficiently distilling knowledge from the previous model while learning the current task.
We attain state-of-the-art performance on English, Italian and Russian scripts, whilst adding only a few parameters per task.
arXiv Detail & Related papers (2023-03-16T14:27:45Z) - ILASR: Privacy-Preserving Incremental Learning for AutomaticSpeech
Recognition at Production Scale [19.524894956258343]
This paper uses a cloud based framework for production systems to demonstrate insights from privacy preserving incremental learning for automatic speech recognition (ILASR)
We show that the proposed system can improve the production models significantly(3%) over a new time period of six months even in the absence of human annotated labels.
arXiv Detail & Related papers (2022-07-19T05:24:13Z) - Lip-Listening: Mixing Senses to Understand Lips using Cross Modality
Knowledge Distillation for Word-Based Models [0.03499870393443267]
This work builds on recent state-of-the-art word-based lipreading models by integrating sequence-level and frame-level Knowledge Distillation (KD) to their systems.
We propose a technique to transfer speech recognition capabilities from audio speech recognition systems to visual speech recognizers, where our goal is to utilize audio data during lipreading model training.
arXiv Detail & Related papers (2022-06-05T15:47:54Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks.
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation.
We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z) - UniSpeech-SAT: Universal Speech Representation Learning with Speaker
Aware Pre-Training [72.004873454347]
Two methods are introduced for enhancing the unsupervised speaker information extraction.
Experiment results on SUPERB benchmark show that the proposed system achieves state-of-the-art performance.
We scale up training dataset to 94 thousand hours public audio data and achieve further performance improvement.
arXiv Detail & Related papers (2021-10-12T05:43:30Z) - Online Continual Learning with Natural Distribution Shifts: An Empirical
Study with Visual Data [101.6195176510611]
"Online" continual learning enables evaluating both information retention and online learning efficacy.
In online continual learning, each incoming small batch of data is first used for testing and then added to the training set, making the problem truly online.
We introduce a new benchmark for online continual visual learning that exhibits large scale and natural distribution shifts.
arXiv Detail & Related papers (2021-08-20T06:17:20Z) - Continual-wav2vec2: an Application of Continual Learning for
Self-Supervised Automatic Speech Recognition [0.23872611575805824]
We present a method for continual learning of speech representations for multiple languages using self-supervised learning (SSL)
Wav2vec models perform SSL on raw audio in a pretraining phase and then finetune on a small fraction of annotated data.
We use ideas from continual learning to transfer knowledge from a previous task to speed up pretraining a new language task.
arXiv Detail & Related papers (2021-07-26T10:39:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.