Child Speech Recognition in Human-Robot Interaction: Problem Solved?
- URL: http://arxiv.org/abs/2404.17394v1
- Date: Fri, 26 Apr 2024 13:14:28 GMT
- Title: Child Speech Recognition in Human-Robot Interaction: Problem Solved?
- Authors: Ruben Janssens, Eva Verhelst, Giulio Antonio Abbo, Qiaoqiao Ren, Maria Jose Pinto Bernal, Tony Belpaeme,
- Abstract summary: Recent evolutions in data-driven speech recognition might mean a breakthrough for child speech recognition and social robot applications aimed at children.
We revisit a study on child speech recognition from 2017 and show that indeed performance has increased.
While transcription is not perfect yet, the best model recognises 60.3% of sentences correctly barring small grammatical differences.
- Score: 0.024739484546803334
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automated Speech Recognition shows superhuman performance for adult English speech on a range of benchmarks, but disappoints when fed children's speech. This has long sat in the way of child-robot interaction. Recent evolutions in data-driven speech recognition, including the availability of Transformer architectures and unprecedented volumes of training data, might mean a breakthrough for child speech recognition and social robot applications aimed at children. We revisit a study on child speech recognition from 2017 and show that indeed performance has increased, with newcomer OpenAI Whisper doing markedly better than leading commercial cloud services. While transcription is not perfect yet, the best model recognises 60.3% of sentences correctly barring small grammatical differences, with sub-second transcription time running on a local GPU, showing potential for usable autonomous child-robot speech interactions.
Related papers
- FASA: a Flexible and Automatic Speech Aligner for Extracting High-quality Aligned Children Speech Data [22.933382649048113]
We propose a new forced-alignment tool, FASA, as a flexible and automatic speech aligner to extract high-quality aligned children's speech data.
We demonstrate its usage on the CHILDES dataset and show that FASA can improve data quality by 13.6$times$ over human annotations.
arXiv Detail & Related papers (2024-06-25T20:37:16Z) - Towards Unsupervised Speech Recognition Without Pronunciation Models [57.222729245842054]
Most languages lack sufficient paired speech and text data to effectively train automatic speech recognition systems.
We propose the removal of reliance on a phoneme lexicon to develop unsupervised ASR systems.
We experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling.
arXiv Detail & Related papers (2024-06-12T16:30:58Z) - A comparative analysis between Conformer-Transducer, Whisper, and
wav2vec2 for improving the child speech recognition [2.965450563218781]
We show that finetuning Conformer-transducer models on child speech yields significant improvements in ASR performance on child speech.
We also show Whisper and wav2vec2 adaptation on different child speech datasets.
arXiv Detail & Related papers (2023-11-07T19:32:48Z) - Can Language Models Learn to Listen? [96.01685069483025]
We present a framework for generating appropriate facial responses from a listener in dyadic social interactions based on the speaker's words.
Our approach autoregressively predicts a response of a listener: a sequence of listener facial gestures, quantized using a VQ-VAE.
We show that our generated listener motion is fluent and reflective of language semantics through quantitative metrics and a qualitative user study.
arXiv Detail & Related papers (2023-08-21T17:59:02Z) - Automatic Speech Recognition of Non-Native Child Speech for Language
Learning Applications [18.849741353784328]
We assess the performance of two state-of-the-art ASR systems, Wav2Vec2.0 and Whisper AI.
We evaluate their performance on read and extemporaneous speech of native and non-native Dutch children.
arXiv Detail & Related papers (2023-06-29T06:14:26Z) - Improving Children's Speech Recognition by Fine-tuning Self-supervised
Adult Speech Representations [2.2191297646252646]
Children's speech recognition is a vital, yet largely overlooked domain when building inclusive speech technologies.
Recent advances in self-supervised learning have created a new opportunity for overcoming this problem of data scarcity.
We leverage self-supervised adult speech representations and use three well-known child speech corpora to build models for children's speech recognition.
arXiv Detail & Related papers (2022-11-14T22:03:36Z) - Transfer Learning for Robust Low-Resource Children's Speech ASR with
Transformers and Source-Filter Warping [11.584388304271029]
We propose a data augmentation technique based on the source-filter model of speech to close the domain gap between adult and children's speech.
Using this augmentation strategy, we apply transfer learning on a Transformer model pre-trained on adult data.
This model follows the recently introduced XLS-R architecture, a wav2vec 2.0 model pre-trained on several cross-lingual adult speech corpora.
arXiv Detail & Related papers (2022-06-19T12:57:47Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Arabic Speech Recognition by End-to-End, Modular Systems and Human [56.96327247226586]
We perform a comprehensive benchmarking for end-to-end transformer ASR, modular HMM-DNN ASR, and human speech recognition.
For ASR the end-to-end work led to 12.5%, 27.5%, 23.8% WER; a new performance milestone for the MGB2, MGB3, and MGB5 challenges respectively.
Our results suggest that human performance in the Arabic language is still considerably better than the machine with an absolute WER gap of 3.6% on average.
arXiv Detail & Related papers (2021-01-21T05:55:29Z) - Self-supervised reinforcement learning for speaker localisation with the
iCub humanoid robot [58.2026611111328]
Looking at a person's face is one of the mechanisms that humans rely on when it comes to filtering speech in noisy environments.
Having a robot that can look toward a speaker could benefit ASR performance in challenging environments.
We propose a self-supervised reinforcement learning-based framework inspired by the early development of humans.
arXiv Detail & Related papers (2020-11-12T18:02:15Z) - Speaker Independent and Multilingual/Mixlingual Speech-Driven Talking
Head Generation Using Phonetic Posteriorgrams [58.617181880383605]
In this work, we propose a novel approach using phonetic posteriorgrams.
Our method doesn't need hand-crafted features and is more robust to noise compared to recent approaches.
Our model is the first to support multilingual/mixlingual speech as input with convincing results.
arXiv Detail & Related papers (2020-06-20T16:32:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.