Related papers: Self-Supervised Models for Phoneme Recognition: Applications in Children's Speech for Reading Learning

Self-Supervised Models for Phoneme Recognition: Applications in Children's Speech for Reading Learning

URL: http://arxiv.org/abs/2503.04710v1
Date: Thu, 06 Mar 2025 18:57:16 GMT
Title: Self-Supervised Models for Phoneme Recognition: Applications in Children's Speech for Reading Learning
Authors: Lucas Block Medin, Thomas Pellegrini, Lucile Gelin,
Abstract summary: We first compare wav2vec 2.0, HuBERT and WavLM models adapted to phoneme recognition in French child speech.<n>We then adapt it by unfreezing its transformer blocks during fine-tuning on child speech.<n>We show that WavLM base+ is more robust to various reading tasks and noise levels.
Score: 9.670752318129326
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Child speech recognition is still an underdeveloped area of research due to the lack of data (especially on non-English languages) and the specific difficulties of this task. Having explored various architectures for child speech recognition in previous work, in this article we tackle recent self-supervised models. We first compare wav2vec 2.0, HuBERT and WavLM models adapted to phoneme recognition in French child speech, and continue our experiments with the best of them, WavLM base+. We then further adapt it by unfreezing its transformer blocks during fine-tuning on child speech, which greatly improves its performance and makes it significantly outperform our base model, a Transformer+CTC. Finally, we study in detail the behaviour of these two models under the real conditions of our application, and show that WavLM base+ is more robust to various reading tasks and noise levels. Index Terms: speech recognition, child speech, self-supervised learning

Related papers

Employing self-supervised learning models for cross-linguistic child speech maturity classification [38.411292716220174]
We apply a novel dataset, SpeechMaturity, to state-of-the-art transformer models to identify child vocalizations.<n>The dataset contains 242,004 labeled vocalizations, magnitudes larger than previous work.<n>Models trained on the dataset outperform state-of-the-art models trained on previous datasets, achieved classification accuracy comparable to humans, and were robust across rural and urban settings.
arXiv Detail & Related papers (2025-06-10T17:20:02Z)
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training [70.31925012315064]
We present CosyVoice 3, an improved model designed for zero-shot multilingual speech synthesis in the wild.<n>Key features of CosyVoice 3 include a novel speech tokenizer to improve prosody naturalness.<n>Data is expanded from ten thousand hours to one million hours, encompassing 9 languages and 18 Chinese dialects.
arXiv Detail & Related papers (2025-05-23T07:55:21Z)
Child Speech Recognition in Human-Robot Interaction: Problem Solved? [0.024739484546803334]
We revisit a study on child speech recognition from 2017 and show that indeed performance has increased. Newcomer OpenAI Whisper doing markedly better than leading commercial cloud services. While transcription is not perfect yet, the best model recognises 60.3% of sentences correctly barring small grammatical differences.
arXiv Detail & Related papers (2024-04-26T13:14:28Z)
A comparative analysis between Conformer-Transducer, Whisper, and wav2vec2 for improving the child speech recognition [2.965450563218781]
We show that finetuning Conformer-transducer models on child speech yields significant improvements in ASR performance on child speech. We also show Whisper and wav2vec2 adaptation on different child speech datasets.
arXiv Detail & Related papers (2023-11-07T19:32:48Z)
Adaptation of Whisper models to child speech recognition [3.2548794659022398]
We show that finetuning Whisper on child speech yields significant improvements in ASR performance on child speech. utilizing self-supervised Wav2vec2 models that have been finetuned on child speech outperforms Whisper finetuning.
arXiv Detail & Related papers (2023-07-24T12:54:45Z)
Automatic Speech Recognition of Non-Native Child Speech for Language Learning Applications [18.849741353784328]
We assess the performance of two state-of-the-art ASR systems, Wav2Vec2.0 and Whisper AI. We evaluate their performance on read and extemporaneous speech of native and non-native Dutch children.
arXiv Detail & Related papers (2023-06-29T06:14:26Z)
Textually Pretrained Speech Language Models [107.10344535390956]
We propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language models. We show using both automatic and human evaluations that TWIST outperforms a cold-start SpeechLM across the board.
arXiv Detail & Related papers (2023-05-22T13:12:16Z)
Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years. We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data. Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z)
The Ability of Self-Supervised Speech Models for Audio Representations [53.19715501273934]
Self-supervised learning (SSL) speech models have achieved unprecedented success in speech representation learning. We conduct extensive experiments on abundant speech and non-speech audio datasets to evaluate the representation ability of state-of-the-art SSL speech models. Results show that SSL speech models could extract meaningful features of a wide range of non-speech audio, while they may also fail on certain types of datasets.
arXiv Detail & Related papers (2022-09-26T15:21:06Z)
Self-Supervised Learning for speech recognition with Intermediate layer supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL) ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers. Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z)
Towards Language Modelling in the Speech Domain Using Sub-word Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes. With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech. We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z)
Arabic Speech Emotion Recognition Employing Wav2vec2.0 and HuBERT Based on BAVED Dataset [0.0]
This paper introduces a deep learning constructed emotional recognition model for Arabic speech dialogues. The developed model employs the state of the art audio representations include wav2vec2.0 and HuBERT. The experiment and performance results of our model overcome the previous known outcomes.
arXiv Detail & Related papers (2021-10-09T00:58:12Z)
Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio Representation [51.37980448183019]
We propose Audio ALBERT, a lite version of the self-supervised speech representation model. We show that Audio ALBERT is capable of achieving competitive performance with those huge models in the downstream tasks. In probing experiments, we find that the latent representations encode richer information of both phoneme and speaker than that of the last layer.
arXiv Detail & Related papers (2020-05-18T10:42:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.