Related papers: Child Speech Recognition in Human-Robot Interaction: Problem Solved?

Child Speech Recognition in Human-Robot Interaction: Problem Solved?

URL: http://arxiv.org/abs/2404.17394v1
Date: Fri, 26 Apr 2024 13:14:28 GMT
Title: Child Speech Recognition in Human-Robot Interaction: Problem Solved?
Authors: Ruben Janssens, Eva Verhelst, Giulio Antonio Abbo, Qiaoqiao Ren, Maria Jose Pinto Bernal, Tony Belpaeme,
Abstract summary: Recent evolutions in data-driven speech recognition might mean a breakthrough for child speech recognition and social robot applications aimed at children. We revisit a study on child speech recognition from 2017 and show that indeed performance has increased. While transcription is not perfect yet, the best model recognises 60.3% of sentences correctly barring small grammatical differences.
Score: 0.024739484546803334
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Automated Speech Recognition shows superhuman performance for adult English speech on a range of benchmarks, but disappoints when fed children's speech. This has long sat in the way of child-robot interaction. Recent evolutions in data-driven speech recognition, including the availability of Transformer architectures and unprecedented volumes of training data, might mean a breakthrough for child speech recognition and social robot applications aimed at children. We revisit a study on child speech recognition from 2017 and show that indeed performance has increased, with newcomer OpenAI Whisper doing markedly better than leading commercial cloud services. While transcription is not perfect yet, the best model recognises 60.3% of sentences correctly barring small grammatical differences, with sub-second transcription time running on a local GPU, showing potential for usable autonomous child-robot speech interactions.

Related papers

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training [70.31925012315064]
We present CosyVoice 3, an improved model designed for zero-shot multilingual speech synthesis in the wild.<n>Key features of CosyVoice 3 include a novel speech tokenizer to improve prosody naturalness.<n>Data is expanded from ten thousand hours to one million hours, encompassing 9 languages and 18 Chinese dialects.
arXiv Detail & Related papers (2025-05-23T07:55:21Z)
Self-Supervised Models for Phoneme Recognition: Applications in Children's Speech for Reading Learning [9.670752318129326]
We first compare wav2vec 2.0, HuBERT and WavLM models adapted to phoneme recognition in French child speech. We then adapt it by unfreezing its transformer blocks during fine-tuning on child speech. We show that WavLM base+ is more robust to various reading tasks and noise levels.
arXiv Detail & Related papers (2025-03-06T18:57:16Z)
Towards Unsupervised Speech Recognition Without Pronunciation Models [57.222729245842054]
Most languages lack sufficient paired speech and text data to effectively train automatic speech recognition systems. We propose the removal of reliance on a phoneme lexicon to develop unsupervised ASR systems. We experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling.
arXiv Detail & Related papers (2024-06-12T16:30:58Z)
SpeechAlign: Aligning Speech Generation to Human Preferences [51.684183257809075]
We introduce SpeechAlign, an iterative self-improvement strategy that aligns speech language models to human preferences. We show that SpeechAlign can bridge the distribution gap and facilitate continuous self-improvement of the speech language model.
arXiv Detail & Related papers (2024-04-08T15:21:17Z)
A comparative analysis between Conformer-Transducer, Whisper, and wav2vec2 for improving the child speech recognition [2.965450563218781]
We show that finetuning Conformer-transducer models on child speech yields significant improvements in ASR performance on child speech. We also show Whisper and wav2vec2 adaptation on different child speech datasets.
arXiv Detail & Related papers (2023-11-07T19:32:48Z)
Can Language Models Learn to Listen? [96.01685069483025]
We present a framework for generating appropriate facial responses from a listener in dyadic social interactions based on the speaker's words. Our approach autoregressively predicts a response of a listener: a sequence of listener facial gestures, quantized using a VQ-VAE. We show that our generated listener motion is fluent and reflective of language semantics through quantitative metrics and a qualitative user study.
arXiv Detail & Related papers (2023-08-21T17:59:02Z)
Automatic Speech Recognition of Non-Native Child Speech for Language Learning Applications [18.849741353784328]
We assess the performance of two state-of-the-art ASR systems, Wav2Vec2.0 and Whisper AI. We evaluate their performance on read and extemporaneous speech of native and non-native Dutch children.
arXiv Detail & Related papers (2023-06-29T06:14:26Z)
Improving Children's Speech Recognition by Fine-tuning Self-supervised Adult Speech Representations [2.2191297646252646]
Children's speech recognition is a vital, yet largely overlooked domain when building inclusive speech technologies. Recent advances in self-supervised learning have created a new opportunity for overcoming this problem of data scarcity. We leverage self-supervised adult speech representations and use three well-known child speech corpora to build models for children's speech recognition.
arXiv Detail & Related papers (2022-11-14T22:03:36Z)
Transfer Learning for Robust Low-Resource Children's Speech ASR with Transformers and Source-Filter Warping [11.584388304271029]
We propose a data augmentation technique based on the source-filter model of speech to close the domain gap between adult and children's speech. Using this augmentation strategy, we apply transfer learning on a Transformer model pre-trained on adult data. This model follows the recently introduced XLS-R architecture, a wav2vec 2.0 model pre-trained on several cross-lingual adult speech corpora.
arXiv Detail & Related papers (2022-06-19T12:57:47Z)
Accented Speech Recognition Inspired by Human Perception [0.0]
This paper explores methods that are inspired by human perception to evaluate possible performance improvements for recognition of accented speech. We explore four methodologies: pre-exposure to multiple accents, grapheme and phoneme-based pronunciations, dropout, and the identification of the layers in the neural network that can specifically be associated with accent modeling. Our results indicate that methods based on human perception are promising in reducing WER and understanding how accented speech is modeled in neural networks for novel accents.
arXiv Detail & Related papers (2021-04-09T22:35:09Z)
Arabic Speech Recognition by End-to-End, Modular Systems and Human [56.96327247226586]
We perform a comprehensive benchmarking for end-to-end transformer ASR, modular HMM-DNN ASR, and human speech recognition. For ASR the end-to-end work led to 12.5%, 27.5%, 23.8% WER; a new performance milestone for the MGB2, MGB3, and MGB5 challenges respectively. Our results suggest that human performance in the Arabic language is still considerably better than the machine with an absolute WER gap of 3.6% on average.
arXiv Detail & Related papers (2021-01-21T05:55:29Z)
Self-supervised reinforcement learning for speaker localisation with the iCub humanoid robot [58.2026611111328]
Looking at a person's face is one of the mechanisms that humans rely on when it comes to filtering speech in noisy environments. Having a robot that can look toward a speaker could benefit ASR performance in challenging environments. We propose a self-supervised reinforcement learning-based framework inspired by the early development of humans.
arXiv Detail & Related papers (2020-11-12T18:02:15Z)
Speaker Independent and Multilingual/Mixlingual Speech-Driven Talking Head Generation Using Phonetic Posteriorgrams [58.617181880383605]
In this work, we propose a novel approach using phonetic posteriorgrams. Our method doesn't need hand-crafted features and is more robust to noise compared to recent approaches. Our model is the first to support multilingual/mixlingual speech as input with convincing results.
arXiv Detail & Related papers (2020-06-20T16:32:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.