TICL+: A Case Study On Speech In-Context Learning for Children's Speech Recognition
- URL: http://arxiv.org/abs/2512.18263v1
- Date: Sat, 20 Dec 2025 08:03:07 GMT
- Title: TICL+: A Case Study On Speech In-Context Learning for Children's Speech Recognition
- Authors: Haolong Zheng, Yekaterina Yegorova, Mark Hasegawa-Johnson,
- Abstract summary: Speech foundation models can address these challenges through Speech In-Context Learning (SICL)<n>We extend an existing retrieval-based method, Text-Embedding KNN for SICL (TICL), introducing an acoustic reranking step to create TICL+.<n>Experiments on four children's speech corpora show that TICL+ achieves up to a 53.3% relative word error rate reduction over zero-shot performance.
- Score: 27.013776992438086
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Children's speech recognition remains challenging due to substantial acoustic and linguistic variability, limited labeled data, and significant differences from adult speech. Speech foundation models can address these challenges through Speech In-Context Learning (SICL), allowing adaptation to new domains without fine-tuning. However, the effectiveness of SICL depends on how in-context examples are selected. We extend an existing retrieval-based method, Text-Embedding KNN for SICL (TICL), introducing an acoustic reranking step to create TICL+. This extension prioritizes examples that are both semantically and acoustically aligned with the test input. Experiments on four children's speech corpora show that TICL+ achieves up to a 53.3% relative word error rate reduction over zero-shot performance and 37.6% over baseline TICL, highlighting the value of combining semantic and acoustic information for robust, scalable ASR in children's speech.
Related papers
- Can large audio language models understand child stuttering speech? speech summarization, and source separation [3.2684800403907506]
Child speech differs from adult speech in acoustics, prosody, and language development, and disfluencies (repetitions, prolongations, blocks)<n>Recent large audio-language models (LALMs) demonstrate strong cross-modal audio understanding.<n>We evaluate several state-of-the-art LALMs in two settings: an interview (mixed speakers) and a reading task (single child)
arXiv Detail & Related papers (2025-10-21T18:53:34Z) - TICL: Text-Embedding KNN For Speech In-Context Learning Unlocks Speech Recognition Abilities of Large Multimodal Models [27.013776992438086]
We propose Text-Embedding KNN for SICL (TICL) to enhance off-the-shelf large multimodal models' speech recognition ability without fine-tuning.<n>Our method enables models to surpass zero-shot performance with up to 84.7% relative WER reduction.
arXiv Detail & Related papers (2025-09-16T17:07:23Z) - Towards Inclusive Communication: A Unified Framework for Generating Spoken Language from Sign, Lip, and Audio [52.859261069569165]
We propose the first unified framework capable of handling diverse combinations of sign language, lip movements, and audio for spoken-language text generation.<n>We focus on three main objectives: (i) designing a unified, modality-agnostic architecture capable of effectively processing heterogeneous inputs; (ii) exploring the underexamined synergy among modalities, particularly the role of lip movements as non-manual cues in sign language comprehension; and (iii) achieving performance on par with or better than state-of-the-art models specialized for individual tasks.
arXiv Detail & Related papers (2025-08-28T06:51:42Z) - Benchmarking Training Paradigms, Dataset Composition, and Model Scaling for Child ASR in ESPnet [72.53502346791814]
We compare flat-start training across datasets, SSL representations (WavLM, XEUS), and decoder architectures.<n> SSL representations are biased toward adult speech, with flat-start training on child speech mitigating these biases.<n>Age-related ASR and speaker verification analysis highlights the limitations of proprietary models.
arXiv Detail & Related papers (2025-08-22T17:59:35Z) - Contextual Speech Extraction: Leveraging Textual History as an Implicit Cue for Target Speech Extraction [50.630431647192054]
This paper investigates a novel approach for Target Speech Extraction (TSE)<n>It relies solely on textual context to extract the target speech.<n>We present three CSE models and analyze their performances on three datasets.
arXiv Detail & Related papers (2025-03-11T18:26:10Z) - Can Whisper perform speech-based in-context learning? [15.931776592470895]
This paper investigates the in-context learning abilities of the Whisper automatic speech recognition (ASR) models released by OpenAI.
A novel speech-based in-context learning (SICL) approach is proposed for test-time adaptation, which can reduce the word error rates (WERs)
Language-level adaptation experiments using Chinese dialects showed that when applying SICL to isolated word ASR, consistent and considerable relative WER reductions can be achieved.
arXiv Detail & Related papers (2023-09-13T16:46:27Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning
for Low-Resource Speech Recognition [159.9312272042253]
Wav-BERT is a cooperative acoustic and linguistic representation learning method.
We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
arXiv Detail & Related papers (2021-09-19T16:39:22Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Learning to Understand Child-directed and Adult-directed Speech [18.29692441616062]
Human language acquisition research indicates that child-directed speech helps language learners.
We compare the task performance of models trained on adult-directed speech (ADS) and child-directed speech (CDS)
We find indications that CDS helps in the initial stages of learning, but eventually, models trained on ADS reach comparable task performance, and generalize better.
arXiv Detail & Related papers (2020-05-06T10:47:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.