Related papers: Who Said What WSW 2.0? Enhanced Automated Analysis of Preschool Classroom Speech

Who Said What WSW 2.0? Enhanced Automated Analysis of Preschool Classroom Speech

URL: http://arxiv.org/abs/2505.09972v1
Date: Thu, 15 May 2025 05:21:34 GMT
Title: Who Said What WSW 2.0? Enhanced Automated Analysis of Preschool Classroom Speech
Authors: Anchen Sun, Tiantian Feng, Gabriela Gutierrez, Juan J Londono, Anfeng Xu, Batya Elbaum, Shrikanth Narayanan, Lynn K Perry, Daniel S Messinger,
Abstract summary: This paper introduces an automated framework WSW2.0 for analyzing vocal interactions in preschool classrooms.<n>WSW2.0 achieves a weighted F1 score of.845, accuracy of.846, and an error-corrected kappa of.672 for speaker classification (child vs. teacher)<n>We apply the framework to an extensive dataset spanning two years and over 1,592 hours of classroom audio recordings.
Score: 24.034728707160497
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: This paper introduces an automated framework WSW2.0 for analyzing vocal interactions in preschool classrooms, enhancing both accuracy and scalability through the integration of wav2vec2-based speaker classification and Whisper (large-v2 and large-v3) speech transcription. A total of 235 minutes of audio recordings (160 minutes from 12 children and 75 minutes from 5 teachers), were used to compare system outputs to expert human annotations. WSW2.0 achieves a weighted F1 score of .845, accuracy of .846, and an error-corrected kappa of .672 for speaker classification (child vs. teacher). Transcription quality is moderate to high with word error rates of .119 for teachers and .238 for children. WSW2.0 exhibits relatively high absolute agreement intraclass correlations (ICC) with expert transcriptions for a range of classroom language features. These include teacher and child mean utterance length, lexical diversity, question asking, and responses to questions and other utterances, which show absolute agreement intraclass correlations between .64 and .98. To establish scalability, we apply the framework to an extensive dataset spanning two years and over 1,592 hours of classroom audio recordings, demonstrating the framework's robustness for broad real-world applications. These findings highlight the potential of deep learning and natural language processing techniques to revolutionize educational research by providing accurate measures of key features of preschool classroom speech, ultimately guiding more effective intervention strategies and supporting early childhood language development.

Related papers

Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice [52.747242157396315]
Simultaneous Interpretation (SI) represents one of the most daunting frontiers in the translation industry.<n>We introduce Seed-LiveInterpret 2.0, an end-to-end SI model that delivers high-fidelity, ultra-low-latency speech-to-speech generation with voice cloning capabilities.
arXiv Detail & Related papers (2025-07-23T14:07:41Z)
An End-to-End Approach for Child Reading Assessment in the Xhosa Language [0.3579433677269426]
This study focuses on Xhosa, a language spoken in South Africa, to advance child speech recognition capabilities.<n>We present a novel dataset composed of child speech samples in Xhosa.<n>The results indicate that the performance of these models can be significantly influenced by the amount and balancing of the available training data.
arXiv Detail & Related papers (2025-05-23T00:59:58Z)
Automatic Proficiency Assessment in L2 English Learners [51.652753736780205]
Second language proficiency (L2) in English is usually perceptually evaluated by English teachers or expert evaluators.<n>This paper explores deep learning techniques for comprehensive L2 proficiency assessment, addressing both the speech signal and its correspondent transcription.
arXiv Detail & Related papers (2025-05-05T12:36:03Z)
Automatic Speech Recognition for Non-Native English: Accuracy and Disfluency Handling [0.0]
This study assesses five cutting-edge ASR systems' recognition of non-native English accented speech using recordings from the L2-ARCTIC corpus.<n>For read speech, Whisper and AssemblyAI achieved the best accuracy with mean Match Error Rates (MER) of 0.054 and 0.056 respectively.<n>For spontaneous speech, RevAI performed best with a mean MER of 0.063.
arXiv Detail & Related papers (2025-03-10T05:09:44Z)
Speechworthy Instruction-tuned Language Models [71.8586707840169]
We show that both prompting and preference learning increase the speech-suitability of popular instruction-tuned LLMs. We share lexical, syntactical, and qualitative analyses to showcase how each method contributes to improving the speech-suitability of generated responses.
arXiv Detail & Related papers (2024-09-23T02:34:42Z)
Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation [71.31331402404662]
This paper proposes two novel data-efficient methods to learn dysarthric and elderly speaker-level features. Speaker-regularized spectral basis embedding-SBE features that exploit a special regularization term to enforce homogeneity of speaker features in adaptation. Feature-based learning hidden unit contributions (f-LHUC) that are conditioned on VR-LH features that are shown to be insensitive to speaker-level data quantity in testtime adaptation.
arXiv Detail & Related papers (2024-07-08T18:20:24Z)
Who Said What? An Automated Approach to Analyzing Speech in Preschool Classrooms [0.4207829324073153]
We propose an automated framework that uses software to classify speakers and to transcribe their utterances. We compare results from our framework to those from a human expert for 110 minutes of classroom recordings. The results suggest substantial progress in analyzing classroom speech that may support children's language development.
arXiv Detail & Related papers (2024-01-14T18:27:37Z)
Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading [73.59525356467574]
A speaker's own characteristics can always be portrayed well by his/her few facial images or even a single image with shallow networks. Fine-grained dynamic features associated with speech content expressed by the talking face always need deep sequential networks. Our approach consistently outperforms existing methods.
arXiv Detail & Related papers (2023-10-08T07:48:25Z)
Understanding Spoken Language Development of Children with ASD Using Pre-trained Speech Embeddings [26.703275678213135]
Natural Language Sample (NLS) analysis has gained attention as a promising complement to traditional methods. This paper proposes applications of speech processing technologies in support of automated assessment of children's spoken language development.
arXiv Detail & Related papers (2023-05-23T14:39:49Z)
Nonwords Pronunciation Classification in Language Development Tests for Preschool Children [7.224391516694955]
This work aims to automatically evaluate whether the language development of children is age-appropriate. In this work, the task is to determine whether spoken nonwords have been uttered correctly. We compare different approaches that are motivated to model specific language structures.
arXiv Detail & Related papers (2022-06-16T10:19:47Z)
Speaker-Conditioned Hierarchical Modeling for Automated Speech Scoring [60.55025339250815]
We propose a novel deep learning technique for non-native ASS, called speaker-conditioned hierarchical modeling. We take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. In our technique, we take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. We extract context from these responses and feed them as additional speaker-specific context to our network to score a particular response.
arXiv Detail & Related papers (2021-08-30T07:00:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.