Related papers: Zero-Shot KWS for Children's Speech using Layer-Wise Features from SSL Models

Zero-Shot KWS for Children's Speech using Layer-Wise Features from SSL Models

URL: http://arxiv.org/abs/2508.21248v1
Date: Thu, 28 Aug 2025 22:32:42 GMT
Title: Zero-Shot KWS for Children's Speech using Layer-Wise Features from SSL Models
Authors: Subham Kutum, Abhijit Sinha, Hemant Kumar Kathania, Sudarsana Reddy Kadiri, Mahesh Chandra Govil,
Abstract summary: This paper introduces a zero-shot KWS approach that leverages state-of-the-art self-supervised learning (SSL) models.<n>Features are extracted layer-wise from these SSL models and used to train a Kaldi-based KWS system.<n>Our approach achieved state-of-the-art results across all keyword sets for children's speech.
Score: 12.976537689930373
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Numerous methods have been proposed to enhance Keyword Spotting (KWS) in adult speech, but children's speech presents unique challenges for KWS systems due to its distinct acoustic and linguistic characteristics. This paper introduces a zero-shot KWS approach that leverages state-of-the-art self-supervised learning (SSL) models, including Wav2Vec2, HuBERT and Data2Vec. Features are extracted layer-wise from these SSL models and used to train a Kaldi-based DNN KWS system. The WSJCAM0 adult speech dataset was used for training, while the PFSTAR children's speech dataset was used for testing, demonstrating the zero-shot capability of our method. Our approach achieved state-of-the-art results across all keyword sets for children's speech. Notably, the Wav2Vec2 model, particularly layer 22, performed the best, delivering an ATWV score of 0.691, a MTWV score of 0.7003 and probability of false alarm and probability of miss of 0.0164 and 0.0547 respectively, for a set of 30 keywords. Furthermore, age-specific performance evaluation confirmed the system's effectiveness across different age groups of children. To assess the system's robustness against noise, additional experiments were conducted using the best-performing layer of the best-performing Wav2Vec2 model. The results demonstrated a significant improvement over traditional MFCC-based baseline, emphasizing the potential of SSL embeddings even in noisy conditions. To further generalize the KWS framework, the experiments were repeated for an additional CMU dataset. Overall the results highlight the significant contribution of SSL features in enhancing Zero-Shot KWS performance for children's speech, effectively addressing the challenges associated with the distinct characteristics of child speakers.

Related papers

Can Layer-wise SSL Features Improve Zero-Shot ASR Performance for Children's Speech? [43.31597557333867]
This study investigates the effectiveness of layer-wise features extracted from state-of-the-art SSL pre-trained models in improving the performance of ASR for children's speech in zero-shot scenarios.<n>The analysis identified the most effective layers for enhancing ASR performance on children's speech in a zero-shot scenario.
arXiv Detail & Related papers (2025-08-28T21:32:36Z)
SageLM: A Multi-aspect and Explainable Large Language Model for Speech Judgement [74.51476422119457]
Speech-to-Speech (S2S) Large Language Models (LLMs) are foundational to natural human-computer interaction.<n>We propose textttSageLM, an end-to-end, multi-aspect, and explainable speech LLM for comprehensive S2S LLMs evaluation.
arXiv Detail & Related papers (2025-08-28T15:47:37Z)
Benchmarking Training Paradigms, Dataset Composition, and Model Scaling for Child ASR in ESPnet [72.53502346791814]
We compare flat-start training across datasets, SSL representations (WavLM, XEUS), and decoder architectures.<n> SSL representations are biased toward adult speech, with flat-start training on child speech mitigating these biases.<n>Age-related ASR and speaker verification analysis highlights the limitations of proprietary models.
arXiv Detail & Related papers (2025-08-22T17:59:35Z)
Layer-Wise Analysis of Self-Supervised Representations for Age and Gender Classification in Children's Speech [27.60599947546406]
This paper presents a detailed analysis of four Wav2Vec2 variants using the PFSTAR and CMU Kids datasets.<n>Results show that early layers capture speaker-specific cues more effectively than deeper layers.<n>Applying PCA further improves classification, reducing redundancy and highlighting the most informative components.
arXiv Detail & Related papers (2025-08-14T04:11:44Z)
Noise-Robust Target-Speaker Voice Activity Detection Through Self-Supervised Pretraining [21.26555178371168]
Target-Speaker Voice Activity Detection (TS-VAD) is the task of detecting the presence of speech from a known target-speaker in an audio frame.<n>Deep neural network-based models have shown good performance in this task.<n>We propose a causal, Self-Supervised Learning (SSL) pretraining framework to enhance TS-VAD performance in noisy conditions.
arXiv Detail & Related papers (2025-01-06T18:00:14Z)
Self-supervised ASR Models and Features For Dysarthric and Elderly Speech Recognition [71.87998918300806]
This paper explores approaches to integrate domain fine-tuned SSL pre-trained models and their features into TDNN and Conformer ASR systems. TDNN systems constructed by integrating domain-adapted HuBERT, wav2vec2-conformer or multi-lingual XLSR models consistently outperform standalone fine-tuned SSL pre-trained models. Consistent improvements in Alzheimer's Disease detection accuracy are also obtained using the DementiaBank Pitt elderly speech recognition outputs.
arXiv Detail & Related papers (2024-07-03T08:33:39Z)
LeBenchmark 2.0: a Standardized, Replicable and Enhanced Framework for Self-supervised Representations of French Speech [70.3307853082527]
This work introduces LeBenchmark 2.0 an open-source framework for assessing and building SSL-equipped French speech technologies. It includes documented, large-scale and heterogeneous corpora with up to 14,000 hours of heterogeneous speech. It includes ten pre-trained SSL wav2vec 2.0 models containing from 26 million to one billion learnable parameters shared with the community.
arXiv Detail & Related papers (2023-09-11T14:13:09Z)
Nonwords Pronunciation Classification in Language Development Tests for Preschool Children [7.224391516694955]
This work aims to automatically evaluate whether the language development of children is age-appropriate. In this work, the task is to determine whether spoken nonwords have been uttered correctly. We compare different approaches that are motivated to model specific language structures.
arXiv Detail & Related papers (2022-06-16T10:19:47Z)
Learning Decoupling Features Through Orthogonality Regularization [55.79910376189138]
Keywords spotting (KWS) and speaker verification (SV) are two important tasks in speech applications. We develop a two-branch deep network (KWS branch and SV branch) with the same network structure. A novel decoupling feature learning method is proposed to push up the performance of KWS and SV simultaneously.
arXiv Detail & Related papers (2022-03-31T03:18:13Z)
Self-Supervised Learning for speech recognition with Intermediate layer supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL) ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers. Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.