Layer-Wise Analysis of Self-Supervised Representations for Age and Gender Classification in Children's Speech
- URL: http://arxiv.org/abs/2508.10332v1
- Date: Thu, 14 Aug 2025 04:11:44 GMT
- Title: Layer-Wise Analysis of Self-Supervised Representations for Age and Gender Classification in Children's Speech
- Authors: Abhijit Sinha, Harishankar Kumar, Mohit Joshi, Hemant Kumar Kathania, Shrikanth Narayanan, Sudarsana Reddy Kadiri,
- Abstract summary: This paper presents a detailed analysis of four Wav2Vec2 variants using the PFSTAR and CMU Kids datasets.<n>Results show that early layers capture speaker-specific cues more effectively than deeper layers.<n>Applying PCA further improves classification, reducing redundancy and highlighting the most informative components.
- Score: 27.60599947546406
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Children's speech presents challenges for age and gender classification due to high variability in pitch, articulation, and developmental traits. While self-supervised learning (SSL) models perform well on adult speech tasks, their ability to encode speaker traits in children remains underexplored. This paper presents a detailed layer-wise analysis of four Wav2Vec2 variants using the PFSTAR and CMU Kids datasets. Results show that early layers (1-7) capture speaker-specific cues more effectively than deeper layers, which increasingly focus on linguistic information. Applying PCA further improves classification, reducing redundancy and highlighting the most informative components. The Wav2Vec2-large-lv60 model achieves 97.14% (age) and 98.20% (gender) on CMU Kids; base-100h and large-lv60 models reach 86.05% and 95.00% on PFSTAR. These results reveal how speaker traits are structured across SSL model depth and support more targeted, adaptive strategies for child-aware speech interfaces.
Related papers
- KidVis: Do Multimodal Large Language Models Possess the Visual Perceptual Capabilities of a 6-Year-Old? [79.27736230305516]
We introduce KidVis, a novel benchmark grounded in the theory of human visual development.<n> evaluating 20 state-of-the-art MLLMs against a human physiological baseline reveals a stark performance disparity.<n>This study confirms that current MLLMs, despite their reasoning prowess, lack the essential physiological perceptual primitives required for generalized visual intelligence.
arXiv Detail & Related papers (2026-01-13T07:32:50Z) - KidSpeak: A General Multi-purpose LLM for Kids' Speech Recognition and Screening [29.54910094759367]
KidSpeak is a speech-enhanced Foundation Model capable of both generative and discriminative tasks specifically tailored to children's speech patterns.<n>We propose the Flexible and Automatic Speech Aligner (FASA) and leverage the method to construct high quality datasets for training and evaluation.<n>This novel alignment tool significantly improves the quality of aligned children's speech from noisy data, enhancing data quality by 13.6x compared to human annotations.
arXiv Detail & Related papers (2025-12-01T00:19:37Z) - Zero-Shot KWS for Children's Speech using Layer-Wise Features from SSL Models [12.976537689930373]
This paper introduces a zero-shot KWS approach that leverages state-of-the-art self-supervised learning (SSL) models.<n>Features are extracted layer-wise from these SSL models and used to train a Kaldi-based KWS system.<n>Our approach achieved state-of-the-art results across all keyword sets for children's speech.
arXiv Detail & Related papers (2025-08-28T22:32:42Z) - Can Layer-wise SSL Features Improve Zero-Shot ASR Performance for Children's Speech? [43.31597557333867]
This study investigates the effectiveness of layer-wise features extracted from state-of-the-art SSL pre-trained models in improving the performance of ASR for children's speech in zero-shot scenarios.<n>The analysis identified the most effective layers for enhancing ASR performance on children's speech in a zero-shot scenario.
arXiv Detail & Related papers (2025-08-28T21:32:36Z) - Benchmarking Training Paradigms, Dataset Composition, and Model Scaling for Child ASR in ESPnet [72.53502346791814]
We compare flat-start training across datasets, SSL representations (WavLM, XEUS), and decoder architectures.<n> SSL representations are biased toward adult speech, with flat-start training on child speech mitigating these biases.<n>Age-related ASR and speaker verification analysis highlights the limitations of proprietary models.
arXiv Detail & Related papers (2025-08-22T17:59:35Z) - Who Said What WSW 2.0? Enhanced Automated Analysis of Preschool Classroom Speech [24.034728707160497]
This paper introduces an automated framework WSW2.0 for analyzing vocal interactions in preschool classrooms.<n>WSW2.0 achieves a weighted F1 score of.845, accuracy of.846, and an error-corrected kappa of.672 for speaker classification (child vs. teacher)<n>We apply the framework to an extensive dataset spanning two years and over 1,592 hours of classroom audio recordings.
arXiv Detail & Related papers (2025-05-15T05:21:34Z) - Self-supervised ASR Models and Features For Dysarthric and Elderly Speech Recognition [71.87998918300806]
This paper explores approaches to integrate domain fine-tuned SSL pre-trained models and their features into TDNN and Conformer ASR systems.
TDNN systems constructed by integrating domain-adapted HuBERT, wav2vec2-conformer or multi-lingual XLSR models consistently outperform standalone fine-tuned SSL pre-trained models.
Consistent improvements in Alzheimer's Disease detection accuracy are also obtained using the DementiaBank Pitt elderly speech recognition outputs.
arXiv Detail & Related papers (2024-07-03T08:33:39Z) - Improving child speech recognition with augmented child-like speech [20.709414063132627]
Cross-lingual child-to-child voice conversion significantly improved child ASR performance.
State-of-the-art ASRs show suboptimal performance for child speech.
arXiv Detail & Related papers (2024-06-12T08:56:46Z) - Text Classification via Large Language Models [63.1874290788797]
We introduce Clue And Reasoning Prompting (CARP) to address complex linguistic phenomena involved in text classification.
Remarkably, CARP yields new SOTA performances on 4 out of 5 widely-used text-classification benchmarks.
More importantly, we find that CARP delivers impressive abilities on low-resource and domain-adaptation setups.
arXiv Detail & Related papers (2023-05-15T06:24:45Z) - Transfer Learning for Robust Low-Resource Children's Speech ASR with
Transformers and Source-Filter Warping [11.584388304271029]
We propose a data augmentation technique based on the source-filter model of speech to close the domain gap between adult and children's speech.
Using this augmentation strategy, we apply transfer learning on a Transformer model pre-trained on adult data.
This model follows the recently introduced XLS-R architecture, a wav2vec 2.0 model pre-trained on several cross-lingual adult speech corpora.
arXiv Detail & Related papers (2022-06-19T12:57:47Z) - Nonwords Pronunciation Classification in Language Development Tests for
Preschool Children [7.224391516694955]
This work aims to automatically evaluate whether the language development of children is age-appropriate.
In this work, the task is to determine whether spoken nonwords have been uttered correctly.
We compare different approaches that are motivated to model specific language structures.
arXiv Detail & Related papers (2022-06-16T10:19:47Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z) - On-the-Fly Feature Based Rapid Speaker Adaptation for Dysarthric and
Elderly Speech Recognition [53.17176024917725]
Scarcity of speaker-level data limits the practical use of data-intensive model based speaker adaptation methods.
This paper proposes two novel forms of data-efficient, feature-based on-the-fly speaker adaptation methods.
arXiv Detail & Related papers (2022-03-28T09:12:24Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - Speaker-Conditioned Hierarchical Modeling for Automated Speech Scoring [60.55025339250815]
We propose a novel deep learning technique for non-native ASS, called speaker-conditioned hierarchical modeling.
We take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. In our technique, we take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. We extract context from these responses and feed them as additional speaker-specific context to our network to score a particular response.
arXiv Detail & Related papers (2021-08-30T07:00:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.