Related papers: Layer-Wise Analysis of Self-Supervised Representations for Age and Gender Classification in Children's Speech

Layer-Wise Analysis of Self-Supervised Representations for Age and Gender Classification in Children's Speech

URL: http://arxiv.org/abs/2508.10332v1
Date: Thu, 14 Aug 2025 04:11:44 GMT
Title: Layer-Wise Analysis of Self-Supervised Representations for Age and Gender Classification in Children's Speech
Authors: Abhijit Sinha, Harishankar Kumar, Mohit Joshi, Hemant Kumar Kathania, Shrikanth Narayanan, Sudarsana Reddy Kadiri,
Abstract summary: This paper presents a detailed analysis of four Wav2Vec2 variants using the PFSTAR and CMU Kids datasets.<n>Results show that early layers capture speaker-specific cues more effectively than deeper layers.<n>Applying PCA further improves classification, reducing redundancy and highlighting the most informative components.
Score: 27.60599947546406
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Children's speech presents challenges for age and gender classification due to high variability in pitch, articulation, and developmental traits. While self-supervised learning (SSL) models perform well on adult speech tasks, their ability to encode speaker traits in children remains underexplored. This paper presents a detailed layer-wise analysis of four Wav2Vec2 variants using the PFSTAR and CMU Kids datasets. Results show that early layers (1-7) capture speaker-specific cues more effectively than deeper layers, which increasingly focus on linguistic information. Applying PCA further improves classification, reducing redundancy and highlighting the most informative components. The Wav2Vec2-large-lv60 model achieves 97.14% (age) and 98.20% (gender) on CMU Kids; base-100h and large-lv60 models reach 86.05% and 95.00% on PFSTAR. These results reveal how speaker traits are structured across SSL model depth and support more targeted, adaptive strategies for child-aware speech interfaces.

Related papers

KidVis: Do Multimodal Large Language Models Possess the Visual Perceptual Capabilities of a 6-Year-Old? [79.27736230305516]
We introduce KidVis, a novel benchmark grounded in the theory of human visual development.<n> evaluating 20 state-of-the-art MLLMs against a human physiological baseline reveals a stark performance disparity.<n>This study confirms that current MLLMs, despite their reasoning prowess, lack the essential physiological perceptual primitives required for generalized visual intelligence.
arXiv Detail & Related papers (2026-01-13T07:32:50Z)
KidSpeak: A General Multi-purpose LLM for Kids' Speech Recognition and Screening [29.54910094759367]
KidSpeak is a speech-enhanced Foundation Model capable of both generative and discriminative tasks specifically tailored to children's speech patterns.<n>We propose the Flexible and Automatic Speech Aligner (FASA) and leverage the method to construct high quality datasets for training and evaluation.<n>This novel alignment tool significantly improves the quality of aligned children's speech from noisy data, enhancing data quality by 13.6x compared to human annotations.
arXiv Detail & Related papers (2025-12-01T00:19:37Z)
Zero-Shot KWS for Children's Speech using Layer-Wise Features from SSL Models [12.976537689930373]
This paper introduces a zero-shot KWS approach that leverages state-of-the-art self-supervised learning (SSL) models.<n>Features are extracted layer-wise from these SSL models and used to train a Kaldi-based KWS system.<n>Our approach achieved state-of-the-art results across all keyword sets for children's speech.
arXiv Detail & Related papers (2025-08-28T22:32:42Z)
Can Layer-wise SSL Features Improve Zero-Shot ASR Performance for Children's Speech? [43.31597557333867]
This study investigates the effectiveness of layer-wise features extracted from state-of-the-art SSL pre-trained models in improving the performance of ASR for children's speech in zero-shot scenarios.<n>The analysis identified the most effective layers for enhancing ASR performance on children's speech in a zero-shot scenario.
arXiv Detail & Related papers (2025-08-28T21:32:36Z)
Benchmarking Training Paradigms, Dataset Composition, and Model Scaling for Child ASR in ESPnet [72.53502346791814]
We compare flat-start training across datasets, SSL representations (WavLM, XEUS), and decoder architectures.<n> SSL representations are biased toward adult speech, with flat-start training on child speech mitigating these biases.<n>Age-related ASR and speaker verification analysis highlights the limitations of proprietary models.
arXiv Detail & Related papers (2025-08-22T17:59:35Z)
Who Said What WSW 2.0? Enhanced Automated Analysis of Preschool Classroom Speech [24.034728707160497]
This paper introduces an automated framework WSW2.0 for analyzing vocal interactions in preschool classrooms.<n>WSW2.0 achieves a weighted F1 score of.845, accuracy of.846, and an error-corrected kappa of.672 for speaker classification (child vs. teacher)<n>We apply the framework to an extensive dataset spanning two years and over 1,592 hours of classroom audio recordings.
arXiv Detail & Related papers (2025-05-15T05:21:34Z)
Self-supervised ASR Models and Features For Dysarthric and Elderly Speech Recognition [71.87998918300806]
This paper explores approaches to integrate domain fine-tuned SSL pre-trained models and their features into TDNN and Conformer ASR systems. TDNN systems constructed by integrating domain-adapted HuBERT, wav2vec2-conformer or multi-lingual XLSR models consistently outperform standalone fine-tuned SSL pre-trained models. Consistent improvements in Alzheimer's Disease detection accuracy are also obtained using the DementiaBank Pitt elderly speech recognition outputs.
arXiv Detail & Related papers (2024-07-03T08:33:39Z)
Improving child speech recognition with augmented child-like speech [20.709414063132627]
Cross-lingual child-to-child voice conversion significantly improved child ASR performance. State-of-the-art ASRs show suboptimal performance for child speech.
arXiv Detail & Related papers (2024-06-12T08:56:46Z)
Text Classification via Large Language Models [63.1874290788797]
We introduce Clue And Reasoning Prompting (CARP) to address complex linguistic phenomena involved in text classification. Remarkably, CARP yields new SOTA performances on 4 out of 5 widely-used text-classification benchmarks. More importantly, we find that CARP delivers impressive abilities on low-resource and domain-adaptation setups.
arXiv Detail & Related papers (2023-05-15T06:24:45Z)
Transfer Learning for Robust Low-Resource Children's Speech ASR with Transformers and Source-Filter Warping [11.584388304271029]
We propose a data augmentation technique based on the source-filter model of speech to close the domain gap between adult and children's speech. Using this augmentation strategy, we apply transfer learning on a Transformer model pre-trained on adult data. This model follows the recently introduced XLS-R architecture, a wav2vec 2.0 model pre-trained on several cross-lingual adult speech corpora.
arXiv Detail & Related papers (2022-06-19T12:57:47Z)
Nonwords Pronunciation Classification in Language Development Tests for Preschool Children [7.224391516694955]
This work aims to automatically evaluate whether the language development of children is age-appropriate. In this work, the task is to determine whether spoken nonwords have been uttered correctly. We compare different approaches that are motivated to model specific language structures.
arXiv Detail & Related papers (2022-06-16T10:19:47Z)
Self-supervised models of audio effectively explain human cortical responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system. We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z)
On-the-Fly Feature Based Rapid Speaker Adaptation for Dysarthric and Elderly Speech Recognition [53.17176024917725]
Scarcity of speaker-level data limits the practical use of data-intensive model based speaker adaptation methods. This paper proposes two novel forms of data-efficient, feature-based on-the-fly speaker adaptation methods.
arXiv Detail & Related papers (2022-03-28T09:12:24Z)
Self-Supervised Learning for speech recognition with Intermediate layer supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL) ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers. Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z)
Speaker-Conditioned Hierarchical Modeling for Automated Speech Scoring [60.55025339250815]
We propose a novel deep learning technique for non-native ASS, called speaker-conditioned hierarchical modeling. We take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. In our technique, we take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. We extract context from these responses and feed them as additional speaker-specific context to our network to score a particular response.
arXiv Detail & Related papers (2021-08-30T07:00:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.