Automatic Speech Recognition of Non-Native Child Speech for Language
Learning Applications
- URL: http://arxiv.org/abs/2306.16710v1
- Date: Thu, 29 Jun 2023 06:14:26 GMT
- Title: Automatic Speech Recognition of Non-Native Child Speech for Language
Learning Applications
- Authors: Simone Wills, Yu Bai, Cristian Tejedor-Garcia, Catia Cucchiarini,
Helmer Strik
- Abstract summary: We assess the performance of two state-of-the-art ASR systems, Wav2Vec2.0 and Whisper AI.
We evaluate their performance on read and extemporaneous speech of native and non-native Dutch children.
- Score: 18.849741353784328
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Voicebots have provided a new avenue for supporting the development of
language skills, particularly within the context of second language learning.
Voicebots, though, have largely been geared towards native adult speakers. We
sought to assess the performance of two state-of-the-art ASR systems,
Wav2Vec2.0 and Whisper AI, with a view to developing a voicebot that can
support children acquiring a foreign language. We evaluated their performance
on read and extemporaneous speech of native and non-native Dutch children. We
also investigated the utility of using ASR technology to provide insight into
the children's pronunciation and fluency. The results show that recent,
pre-trained ASR transformer-based models achieve acceptable performance from
which detailed feedback on phoneme pronunciation quality can be extracted,
despite the challenging nature of child and non-native speech.
Related papers
- FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs [63.8261207950923]
FunAudioLLM is a model family designed to enhance natural voice interactions between humans and large language models (LLMs)
At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, speaking style, and speaker identity.
The models related to SenseVoice and CosyVoice have been open-sourced on Modelscope and Huggingface, along with the corresponding training, inference, and fine-tuning codes released on GitHub.
arXiv Detail & Related papers (2024-07-04T16:49:02Z) - Error-preserving Automatic Speech Recognition of Young English Learners' Language [6.491559928368298]
One of the central skills that language learners need to practice is speaking the language.
Recent advances in speech technology and natural language processing allow for the creation of novel tools to practice their speaking skills.
We build an ASR system that works on spontaneous speech by young language learners and preserves their errors.
arXiv Detail & Related papers (2024-06-05T13:15:37Z) - Child Speech Recognition in Human-Robot Interaction: Problem Solved? [0.024739484546803334]
Recent evolutions in data-driven speech recognition might mean a breakthrough for child speech recognition and social robot applications aimed at children.
We revisit a study on child speech recognition from 2017 and show that indeed performance has increased.
While transcription is not perfect yet, the best model recognises 60.3% of sentences correctly barring small grammatical differences.
arXiv Detail & Related papers (2024-04-26T13:14:28Z) - Adaptation of Whisper models to child speech recognition [3.2548794659022398]
We show that finetuning Whisper on child speech yields significant improvements in ASR performance on child speech.
utilizing self-supervised Wav2vec2 models that have been finetuned on child speech outperforms Whisper finetuning.
arXiv Detail & Related papers (2023-07-24T12:54:45Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - Hindi as a Second Language: Improving Visually Grounded Speech with
Semantically Similar Samples [89.16814518860357]
The objective of this work is to explore the learning of visually grounded speech models (VGS) from multilingual perspective.
Our key contribution in this work is to leverage the power of a high-resource language in a bilingual visually grounded speech model to improve the performance of a low-resource language.
arXiv Detail & Related papers (2023-03-30T16:34:10Z) - Computational Language Acquisition with Theory of Mind [84.2267302901888]
We build language-learning agents equipped with Theory of Mind (ToM) and measure its effects on the learning process.
We find that training speakers with a highly weighted ToM listener component leads to performance gains in our image referential game setting.
arXiv Detail & Related papers (2023-03-02T18:59:46Z) - ASR data augmentation in low-resource settings using cross-lingual
multi-speaker TTS and cross-lingual voice conversion [49.617722668505834]
We show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training.
It is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
arXiv Detail & Related papers (2022-03-29T11:55:30Z) - Automatic Speech recognition for Speech Assessment of Preschool Children [4.554894288663752]
The acoustic and linguistic features of preschool speech are investigated in this study.
Wav2Vec 2.0 is a paradigm that could be used to build a robust end-to-end speech recognition system.
arXiv Detail & Related papers (2022-03-24T07:15:24Z) - Continual-wav2vec2: an Application of Continual Learning for
Self-Supervised Automatic Speech Recognition [0.23872611575805824]
We present a method for continual learning of speech representations for multiple languages using self-supervised learning (SSL)
Wav2vec models perform SSL on raw audio in a pretraining phase and then finetune on a small fraction of annotated data.
We use ideas from continual learning to transfer knowledge from a previous task to speed up pretraining a new language task.
arXiv Detail & Related papers (2021-07-26T10:39:03Z) - Speaker Independent and Multilingual/Mixlingual Speech-Driven Talking
Head Generation Using Phonetic Posteriorgrams [58.617181880383605]
In this work, we propose a novel approach using phonetic posteriorgrams.
Our method doesn't need hand-crafted features and is more robust to noise compared to recent approaches.
Our model is the first to support multilingual/mixlingual speech as input with convincing results.
arXiv Detail & Related papers (2020-06-20T16:32:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.