Data augmentation using prosody and false starts to recognize non-native
children's speech
- URL: http://arxiv.org/abs/2008.12914v1
- Date: Sat, 29 Aug 2020 05:32:32 GMT
- Title: Data augmentation using prosody and false starts to recognize non-native
children's speech
- Authors: Hemant Kathania, Mittul Singh, Tam\'as Gr\'osz, Mikko Kurimo
- Abstract summary: This paper describes AaltoASR's speech recognition system for the INTERSPEECH 2020 shared task on Automatic Speech Recognition.
The task is to recognize non-native speech from children of various age groups given a limited amount of speech.
- Score: 12.911954427107977
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper describes AaltoASR's speech recognition system for the INTERSPEECH
2020 shared task on Automatic Speech Recognition (ASR) for non-native
children's speech. The task is to recognize non-native speech from children of
various age groups given a limited amount of speech. Moreover, the speech being
spontaneous has false starts transcribed as partial words, which in the test
transcriptions leads to unseen partial words. To cope with these two
challenges, we investigate a data augmentation-based approach. Firstly, we
apply the prosody-based data augmentation to supplement the audio data.
Secondly, we simulate false starts by introducing partial-word noise in the
language modeling corpora creating new words. Acoustic models trained on
prosody-based augmented data outperform the models using the baseline recipe or
the SpecAugment-based augmentation. The partial-word noise also helps to
improve the baseline language model. Our ASR system, a combination of these
schemes, is placed third in the evaluation period and achieves the word error
rate of 18.71%. Post-evaluation period, we observe that increasing the amounts
of prosody-based augmented data leads to better performance. Furthermore,
removing low-confidence-score words from hypotheses can lead to further gains.
These two improvements lower the ASR error rate to 17.99%.
Related papers
- Towards Unsupervised Speech Recognition Without Pronunciation Models [57.222729245842054]
Most languages lack sufficient paired speech and text data to effectively train automatic speech recognition systems.
We propose the removal of reliance on a phoneme lexicon to develop unsupervised ASR systems.
We experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling.
arXiv Detail & Related papers (2024-06-12T16:30:58Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - End-to-end speech recognition modeling from de-identified data [1.3400866200396329]
De-identification of data used for automatic speech recognition modeling is a critical component in protecting privacy.
We propose and evaluate a two-step method for partially recovering this loss.
We evaluate the performance of this method on in-house data of medical conversations.
arXiv Detail & Related papers (2022-07-12T11:29:52Z) - Transfer Learning for Robust Low-Resource Children's Speech ASR with
Transformers and Source-Filter Warping [11.584388304271029]
We propose a data augmentation technique based on the source-filter model of speech to close the domain gap between adult and children's speech.
Using this augmentation strategy, we apply transfer learning on a Transformer model pre-trained on adult data.
This model follows the recently introduced XLS-R architecture, a wav2vec 2.0 model pre-trained on several cross-lingual adult speech corpora.
arXiv Detail & Related papers (2022-06-19T12:57:47Z) - Investigation of Data Augmentation Techniques for Disordered Speech
Recognition [69.50670302435174]
This paper investigates a set of data augmentation techniques for disordered speech recognition.
Both normal and disordered speech were exploited in the augmentation process.
The final speaker adapted system constructed using the UASpeech corpus and the best augmentation approach based on speed perturbation produced up to 2.92% absolute word error rate (WER)
arXiv Detail & Related papers (2022-01-14T17:09:22Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - Influence of ASR and Language Model on Alzheimer's Disease Detection [2.4698886064068555]
We analyse the usage of a SotA ASR system to transcribe participant's spoken descriptions from a picture.
We study the influence of a language model -- which tends to correct non-standard sequences of words -- with the lack of language model to decode the hypothesis from the ASR.
The proposed system combines acoustic -- based on prosody and voice quality -- and lexical features based on the first occurrence of the most common words.
arXiv Detail & Related papers (2021-09-20T10:41:39Z) - Low Resource German ASR with Untranscribed Data Spoken by Non-native
Children -- INTERSPEECH 2021 Shared Task SPAPL System [19.435571932141364]
This paper describes the SPAPL system for the INTERSPEECH 2021 Challenge: Shared Task on Automatic Speech Recognition for Non-Native Children's Speech in German.
5 hours of transcribed data and 60 hours of untranscribed data are provided to develop a German ASR system for children.
For the training of the transcribed data, we propose a non-speech state discriminative loss (NSDL) to mitigate the influence of long-duration non-speech segments within speech utterances.
Our system achieves a word error rate (WER) of 39.68% on the evaluation data,
arXiv Detail & Related papers (2021-06-18T07:36:26Z) - On the Impact of Word Error Rate on Acoustic-Linguistic Speech Emotion
Recognition: An Update for the Deep Learning Era [0.0]
We create transcripts from the original speech by applying three modern ASR systems.
For extraction and learning of acoustic speech features, we utilise openSMILE, openXBoW, DeepSpectrum, and auDeep.
We achieve state-of-the-art unweighted average recall values of $73.6,%$ and $73.8,%$ on the speaker-independent development and test partitions of IEMOCAP.
arXiv Detail & Related papers (2021-04-20T17:10:01Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.