Replay to Remember: Continual Layer-Specific Fine-tuning for German
Speech Recognition
- URL: http://arxiv.org/abs/2307.07280v2
- Date: Wed, 18 Oct 2023 10:36:36 GMT
- Title: Replay to Remember: Continual Layer-Specific Fine-tuning for German
Speech Recognition
- Authors: Theresa Pekarek Rosin, Stefan Wermter
- Abstract summary: We study how well the performance of large-scale ASR models can be approximated for smaller domains.
We apply Experience Replay for continual learning to increase the robustness of the ASR model to vocabulary and speakers outside of the fine-tuned domain.
- Score: 19.635428830237842
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While Automatic Speech Recognition (ASR) models have shown significant
advances with the introduction of unsupervised or self-supervised training
techniques, these improvements are still only limited to a subsection of
languages and speakers. Transfer learning enables the adaptation of large-scale
multilingual models to not only low-resource languages but also to more
specific speaker groups. However, fine-tuning on data from new domains is
usually accompanied by a decrease in performance on the original domain.
Therefore, in our experiments, we examine how well the performance of
large-scale ASR models can be approximated for smaller domains, with our own
dataset of German Senior Voice Commands (SVC-de), and how much of the general
speech recognition performance can be preserved by selectively freezing parts
of the model during training. To further increase the robustness of the ASR
model to vocabulary and speakers outside of the fine-tuned domain, we apply
Experience Replay for continual learning. By adding only a fraction of data
from the original domain, we are able to reach Word-Error-Rates (WERs) below
5\% on the new domain, while stabilizing performance for general speech
recognition at acceptable WERs.
Related papers
- SER Evals: In-domain and Out-of-domain Benchmarking for Speech Emotion Recognition [3.4355593397388597]
Speech emotion recognition (SER) has made significant strides with the advent of powerful self-supervised learning (SSL) models.
We propose a large-scale benchmark to evaluate the robustness and adaptability of state-of-the-art SER models.
We find that the Whisper model, primarily designed for automatic speech recognition, outperforms dedicated SSL models in cross-lingual SER.
arXiv Detail & Related papers (2024-08-14T23:33:10Z) - Reduce, Reuse, Recycle: Is Perturbed Data better than Other Language augmentation for Low Resource Self-Supervised Speech Models [48.44820587495038]
Self-supervised representation learning (SSRL) has demonstrated superior performance than supervised models for tasks including phoneme recognition.
Training SSRL models poses a challenge for low-resource languages where sufficient pre-training data may not be available.
We propose to use audio augmentation techniques, namely: pitch variation, noise addition, accented target language and other language speech to pre-train SSRL models in a low resource condition and evaluate phoneme recognition.
arXiv Detail & Related papers (2023-09-22T10:09:09Z) - Improving Accented Speech Recognition with Multi-Domain Training [2.28438857884398]
We use speech audio representing four different French accents to create fine-tuning datasets that improve the robustness of pre-trained ASR models.
Our numerical experiments show that we can reduce error rates by up to 25% (relative) on African and Belgian accents.
arXiv Detail & Related papers (2023-03-14T14:10:16Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - ASR data augmentation in low-resource settings using cross-lingual
multi-speaker TTS and cross-lingual voice conversion [49.617722668505834]
We show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training.
It is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
arXiv Detail & Related papers (2022-03-29T11:55:30Z) - Visual Speech Recognition for Multiple Languages in the Wild [64.52593130370757]
We show that designing better VSR models is equally important to using larger training sets.
We propose the addition of prediction-based auxiliary tasks to a VSR model.
We show that such model works for different languages and outperforms all previous methods trained on publicly available datasets by a large margin.
arXiv Detail & Related papers (2022-02-26T07:21:00Z) - An Exploration of Self-Supervised Pretrained Representations for
End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models.
We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv Detail & Related papers (2021-10-09T15:06:09Z) - Multi-Staged Cross-Lingual Acoustic Model Adaption for Robust Speech
Recognition in Real-World Applications -- A Case Study on German Oral History
Interviews [21.47857960919014]
We propose an approach that performs a robust acoustic model adaption to a target domain in a cross-lingual, multi-staged manner.
Our approach enables the exploitation of large-scale training data from other domains in both the same and other languages.
arXiv Detail & Related papers (2020-05-26T08:05:25Z) - Phoneme Boundary Detection using Learnable Segmental Features [31.203969460341817]
Phoneme boundary detection plays an essential first step for a variety of speech processing applications.
We propose a neural architecture coupled with a parameterized structured loss function to learn segmental representations for the task of phoneme boundary detection.
arXiv Detail & Related papers (2020-02-11T14:03:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.