On-Device Personalization of Automatic Speech Recognition Models for
Disordered Speech
- URL: http://arxiv.org/abs/2106.10259v1
- Date: Fri, 18 Jun 2021 17:48:08 GMT
- Title: On-Device Personalization of Automatic Speech Recognition Models for
Disordered Speech
- Authors: Katrin Tomanek, Fran\c{c}oise Beaufays, Julie Cattiau, Angad
Chandorkar, Khe Chai Sim
- Abstract summary: We present an approach to on-device based ASR personalization with very small amounts of speaker-specific data.
We test our approach on a diverse set of 100 speakers with disordered speech and find median relative word error rate improvement of 71% with only 50 short utterances required per speaker.
- Score: 9.698986579582236
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While current state-of-the-art Automatic Speech Recognition (ASR) systems
achieve high accuracy on typical speech, they suffer from significant
performance degradation on disordered speech and other atypical speech
patterns. Personalization of ASR models, a commonly applied solution to this
problem, is usually performed in a server-based training environment posing
problems around data privacy, delayed model-update times, and communication
cost for copying data and models between mobile device and server
infrastructure. In this paper, we present an approach to on-device based ASR
personalization with very small amounts of speaker-specific data. We test our
approach on a diverse set of 100 speakers with disordered speech and find
median relative word error rate improvement of 71% with only 50 short
utterances required per speaker. When tested on a voice-controlled home
automation platform, on-device personalized models show a median task success
rate of 81%, compared to only 40% of the unadapted models.
Related papers
- Improving Multilingual ASR in the Wild Using Simple N-best Re-ranking [68.77659513993507]
We present a simple and effective N-best re-ranking approach to improve multilingual ASR accuracy.
Our results show spoken language identification accuracy improvements of 8.7% and 6.1%, respectively, and word error rates which are 3.3% and 2.0% lower on these benchmarks.
arXiv Detail & Related papers (2024-09-27T03:31:32Z) - AS-70: A Mandarin stuttered speech dataset for automatic speech recognition and stuttering event detection [46.855958156126164]
AS-70 is the first publicly available Mandarin stuttered speech dataset.
This paper introduces AS-70, the first publicly available Mandarin stuttered speech dataset.
arXiv Detail & Related papers (2024-06-11T13:35:50Z) - SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented
Dialogue Agents [72.42049370297849]
SpokenWOZ is a large-scale speech-text dataset for spoken TOD.
Cross-turn slot and reasoning slot detection are new challenges for SpokenWOZ.
arXiv Detail & Related papers (2023-05-22T13:47:51Z) - Robust Speech Recognition via Large-Scale Weak Supervision [69.63329359286419]
We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet.
When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks.
We are releasing models and inference code to serve as a foundation for further work on robust speech processing.
arXiv Detail & Related papers (2022-12-06T18:46:04Z) - Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings [53.11450530896623]
This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize "who spoke what"
Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion.
The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.
arXiv Detail & Related papers (2022-03-30T21:42:00Z) - Nonverbal Sound Detection for Disordered Speech [24.636175845214822]
We introduce an alternative voice-based input system which relies on sound event detection using fifteen non-verbal mouth sounds.
This system was designed to work regardless of ones' speech abilities and allows full access to existing technology.
arXiv Detail & Related papers (2022-02-15T22:02:58Z) - Robust Self-Supervised Audio-Visual Speech Recognition [29.526786921769613]
We present a self-supervised audio-visual speech recognition framework built upon Audio-Visual HuBERT (AV-HuBERT)
On the largest available AVSR benchmark dataset LRS3, our approach outperforms prior state-of-the-art by 50% (28.0% vs. 14.1%) using less than 10% of labeled data.
Our approach reduces the WER of an audio-based model by over 75% (25.8% vs. 5.8%) on average.
arXiv Detail & Related papers (2022-01-05T18:50:50Z) - Personalized Automatic Speech Recognition Trained on Small Disordered
Speech Datasets [0.0]
We trained personalized models for 195 individuals with different types and severities of speech impairment.
For the home automation scenario, 79% of speakers reached the target WER with 18-20 minutes of speech; but even with only 3-4 minutes of speech, 63% of speakers reached the target WER.
arXiv Detail & Related papers (2021-10-09T17:11:17Z) - Self-Supervised Learning for Personalized Speech Enhancement [25.05285328404576]
Speech enhancement systems can show improved performance by adapting the model towards a single test-time speaker.
Test-time user might only provide a small amount of noise-free speech data, likely insufficient for traditional fully-supervised learning.
We propose self-supervised methods that are designed specifically to learn personalized and discriminative features from abundant in-the-wild noisy, but still personal speech recordings.
arXiv Detail & Related papers (2021-04-05T17:12:51Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.