KidSpeak: A General Multi-purpose LLM for Kids' Speech Recognition and Screening
- URL: http://arxiv.org/abs/2512.05994v1
- Date: Mon, 01 Dec 2025 00:19:37 GMT
- Title: KidSpeak: A General Multi-purpose LLM for Kids' Speech Recognition and Screening
- Authors: Rohan Sharma, Dancheng Liu, Jingchen Sun, Shijie Zhou, Jiayu Qin, Jinjun Xiong, Changyou Chen,
- Abstract summary: KidSpeak is a speech-enhanced Foundation Model capable of both generative and discriminative tasks specifically tailored to children's speech patterns.<n>We propose the Flexible and Automatic Speech Aligner (FASA) and leverage the method to construct high quality datasets for training and evaluation.<n>This novel alignment tool significantly improves the quality of aligned children's speech from noisy data, enhancing data quality by 13.6x compared to human annotations.
- Score: 29.54910094759367
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the rapid advancement of conversational and diffusion-based AI, there is a growing adoption of AI in educational services, ranging from grading and assessment tools to personalized learning systems that provide targeted support for students. However, this adaptability has yet to fully extend to the domain of children's speech, where existing models often fail due to their reliance on datasets designed for clear, articulate adult speech. Children, particularly those in early developmental stages or with speech and language pathologies, present unique challenges that current AI models and datasets are ill-equipped to handle. To address this, we introduce KidSpeak, a multi-task speech-enhanced Foundation Model capable of both generative and discriminative tasks specifically tailored to children's speech patterns. Our framework employs a two-stage training process that incorporates phonetic knowledge into the speech encoder, achieving an average accuracy of 87% across four separate tasks. Furthermore, recognizing the limitations of scalable human annotation and existing speech alignment tools, we propose the Flexible and Automatic Speech Aligner (FASA) and leverage the method to construct high quality datasets for training and evaluation. This novel alignment tool significantly improves the quality of aligned children's speech from noisy data, enhancing data quality by 13.6x compared to human annotations, as demonstrated on the CHILDES dataset. To the best of our knowledge, KidSpeak and FASA represent the first comprehensive solution designed for speech and language therapy in children, offering both a multi-purpose speech LLM and a robust alignment tool.
Related papers
- Can large audio language models understand child stuttering speech? speech summarization, and source separation [3.2684800403907506]
Child speech differs from adult speech in acoustics, prosody, and language development, and disfluencies (repetitions, prolongations, blocks)<n>Recent large audio-language models (LALMs) demonstrate strong cross-modal audio understanding.<n>We evaluate several state-of-the-art LALMs in two settings: an interview (mixed speakers) and a reading task (single child)
arXiv Detail & Related papers (2025-10-21T18:53:34Z) - Benchmarking Training Paradigms, Dataset Composition, and Model Scaling for Child ASR in ESPnet [72.53502346791814]
We compare flat-start training across datasets, SSL representations (WavLM, XEUS), and decoder architectures.<n> SSL representations are biased toward adult speech, with flat-start training on child speech mitigating these biases.<n>Age-related ASR and speaker verification analysis highlights the limitations of proprietary models.
arXiv Detail & Related papers (2025-08-22T17:59:35Z) - Speech Unlearning [14.755831733659699]
We introduce machine unlearning for speech tasks, a novel and underexplored research problem.<n>It aims to efficiently and effectively remove the influence of specific data from trained speech models without full retraining.<n>It has important applications in privacy preservation, removal of outdated or noisy data, and bias mitigation.
arXiv Detail & Related papers (2025-06-01T06:04:16Z) - An End-to-End Approach for Child Reading Assessment in the Xhosa Language [0.3579433677269426]
This study focuses on Xhosa, a language spoken in South Africa, to advance child speech recognition capabilities.<n>We present a novel dataset composed of child speech samples in Xhosa.<n>The results indicate that the performance of these models can be significantly influenced by the amount and balancing of the available training data.
arXiv Detail & Related papers (2025-05-23T00:59:58Z) - DeSTA2: Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs)<n>We present a simple yet effective automatic process for creating speech-text pair data.<n>Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z) - SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks [94.10497337235083]
We are first to explore the potential of prompting speech LMs in the domain of speech processing.
We reformulate speech processing tasks into speech-to-unit generation tasks.
We show that the prompting method can achieve competitive performance compared to the strong fine-tuning method.
arXiv Detail & Related papers (2024-08-23T13:00:10Z) - FASA: a Flexible and Automatic Speech Aligner for Extracting High-quality Aligned Children Speech Data [22.933382649048113]
We propose a new forced-alignment tool, FASA, as a flexible and automatic speech aligner to extract high-quality aligned children's speech data.
We demonstrate its usage on the CHILDES dataset and show that FASA can improve data quality by 13.6$times$ over human annotations.
arXiv Detail & Related papers (2024-06-25T20:37:16Z) - Integrating Self-supervised Speech Model with Pseudo Word-level Targets
from Visually-grounded Speech Model [57.78191634042409]
We propose Pseudo-Word HuBERT (PW-HuBERT), a framework that integrates pseudo word-level targets into the training process.
Our experimental results on four spoken language understanding (SLU) benchmarks suggest the superiority of our model in capturing semantic information.
arXiv Detail & Related papers (2024-02-08T16:55:21Z) - Transfer Learning for Robust Low-Resource Children's Speech ASR with
Transformers and Source-Filter Warping [11.584388304271029]
We propose a data augmentation technique based on the source-filter model of speech to close the domain gap between adult and children's speech.
Using this augmentation strategy, we apply transfer learning on a Transformer model pre-trained on adult data.
This model follows the recently introduced XLS-R architecture, a wav2vec 2.0 model pre-trained on several cross-lingual adult speech corpora.
arXiv Detail & Related papers (2022-06-19T12:57:47Z) - WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks.
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation.
We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z) - UniSpeech-SAT: Universal Speech Representation Learning with Speaker
Aware Pre-Training [72.004873454347]
Two methods are introduced for enhancing the unsupervised speaker information extraction.
Experiment results on SUPERB benchmark show that the proposed system achieves state-of-the-art performance.
We scale up training dataset to 94 thousand hours public audio data and achieve further performance improvement.
arXiv Detail & Related papers (2021-10-12T05:43:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.