Who Said What? An Automated Approach to Analyzing Speech in Preschool Classrooms
- URL: http://arxiv.org/abs/2401.07342v3
- Date: Wed, 10 Apr 2024 21:02:41 GMT
- Title: Who Said What? An Automated Approach to Analyzing Speech in Preschool Classrooms
- Authors: Anchen Sun, Juan J Londono, Batya Elbaum, Luis Estrada, Roberto Jose Lazo, Laura Vitale, Hugo Gonzalez Villasanti, Riccardo Fusaroli, Lynn K Perry, Daniel S Messinger,
- Abstract summary: We propose an automated framework that uses software to classify speakers and to transcribe their utterances.
We compare results from our framework to those from a human expert for 110 minutes of classroom recordings.
The results suggest substantial progress in analyzing classroom speech that may support children's language development.
- Score: 0.4207829324073153
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Young children spend substantial portions of their waking hours in noisy preschool classrooms. In these environments, children's vocal interactions with teachers are critical contributors to their language outcomes, but manually transcribing these interactions is prohibitive. Using audio from child- and teacher-worn recorders, we propose an automated framework that uses open source software both to classify speakers (ALICE) and to transcribe their utterances (Whisper). We compare results from our framework to those from a human expert for 110 minutes of classroom recordings, including 85 minutes from child-word microphones (n=4 children) and 25 minutes from teacher-worn microphones (n=2 teachers). The overall proportion of agreement, that is, the proportion of correctly classified teacher and child utterances, was .76, with an error-corrected kappa of .50 and a weighted F1 of .76. The word error rate for both teacher and child transcriptions was .15, meaning that 15% of words would need to be deleted, added, or changed to equate the Whisper and expert transcriptions. Moreover, speech features such as the mean length of utterances in words, the proportion of teacher and child utterances that were questions, and the proportion of utterances that were responded to within 2.5 seconds were similar when calculated separately from expert and automated transcriptions. The results suggest substantial progress in analyzing classroom speech that may support children's language development. Future research using natural language processing is under way to improve speaker classification and to analyze results from the application of the automated framework to a larger dataset containing classroom recordings from 13 children and 3 teachers observed on 17 occasions over one year.
Related papers
- Who Said What WSW 2.0? Enhanced Automated Analysis of Preschool Classroom Speech [24.034728707160497]
This paper introduces an automated framework WSW2.0 for analyzing vocal interactions in preschool classrooms.<n>WSW2.0 achieves a weighted F1 score of.845, accuracy of.846, and an error-corrected kappa of.672 for speaker classification (child vs. teacher)<n>We apply the framework to an extensive dataset spanning two years and over 1,592 hours of classroom audio recordings.
arXiv Detail & Related papers (2025-05-15T05:21:34Z) - Using State-of-the-Art Speech Models to Evaluate Oral Reading Fluency in
Ghana [0.0]
This paper reports on three recent experiments utilizing large-scale speech models to evaluate the oral reading fluency (ORF) of students in Ghana.
We find that Whisper V2 produces transcriptions of Ghanaian students reading aloud with a Word Error Rate of 13.5.
This is close to the model's average WER on adult speech (12.8) and would have been considered state-of-the-art for children's speech transcription only a few years ago.
arXiv Detail & Related papers (2023-10-26T17:30:13Z) - Can Language Models Learn to Listen? [96.01685069483025]
We present a framework for generating appropriate facial responses from a listener in dyadic social interactions based on the speaker's words.
Our approach autoregressively predicts a response of a listener: a sequence of listener facial gestures, quantized using a VQ-VAE.
We show that our generated listener motion is fluent and reflective of language semantics through quantitative metrics and a qualitative user study.
arXiv Detail & Related papers (2023-08-21T17:59:02Z) - Building a Non-native Speech Corpus Featuring Chinese-English Bilingual
Children: Compilation and Rationale [3.924235219960689]
This paper introduces a non-native speech corpus consisting of narratives from fifty 5- to 6-year-old Chinese-English children.
Transcripts totaling 6.5 hours of children taking a narrative comprehension test in English (L2) are presented, along with human-rated scores and annotations of grammatical and pronunciation errors.
The children also completed the parallel MAIN tests in Chinese (L1) for reference purposes.
arXiv Detail & Related papers (2023-04-30T10:41:43Z) - Self-Supervised Speech Representations Preserve Speech Characteristics
while Anonymizing Voices [15.136348385992047]
We train several voice conversion models using self-supervised speech representations.
Converted voices retain a low word error rate within 1% of the original voice.
Experiments on dysarthric speech data show that speech features relevant to articulation, prosody, phonation and phonology can be extracted from anonymized voices.
arXiv Detail & Related papers (2022-04-04T17:48:01Z) - GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of
Transcribed Audio [88.20960848885575]
GigaSpeech is a multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training.
Around 40,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles.
For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h.
arXiv Detail & Related papers (2021-06-13T04:09:16Z) - Unsupervised Speech Recognition [55.864459085947345]
wav2vec-U, short for wav2vec Unsupervised, is a method to train speech recognition models without any labeled data.
We leverage self-supervised speech representations to segment unlabeled audio and learn a mapping from these representations to phonemes via adversarial training.
On the larger English Librispeech benchmark, wav2vec-U achieves a word error rate of 5.9 on test-other, rivaling some of the best published systems trained on 960 hours of labeled data from only two years ago.
arXiv Detail & Related papers (2021-05-24T04:10:47Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z) - Senone-aware Adversarial Multi-task Training for Unsupervised Child to
Adult Speech Adaptation [26.065719754453823]
We propose a feature adaptation approach to minimize acoustic mismatch at the senone (tied triphone states) level between adult and child speech.
We validate the proposed method on three tasks: child speech recognition, child pronunciation assessment, and child fluency score prediction.
arXiv Detail & Related papers (2021-02-23T04:49:27Z) - Analysis of Disfluency in Children's Speech [25.68434431663045]
We present a novel dataset with annotated disfluencies of spontaneous explanations from 26 children (ages 5--8)
Children have higher disfluency and filler rates, tend to use nasal filled pauses more frequently, and on average exhibit longer reparandums than repairs.
Despite the differences, an automatic disfluency detection system trained on adult (Switchboard) speech transcripts performs reasonably well on children's speech.
arXiv Detail & Related papers (2020-10-08T22:51:25Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z) - Neural Multi-Task Learning for Teacher Question Detection in Online
Classrooms [50.19997675066203]
We build an end-to-end neural framework that automatically detects questions from teachers' audio recordings.
By incorporating multi-task learning techniques, we are able to strengthen the understanding of semantic relations among different types of questions.
arXiv Detail & Related papers (2020-05-16T02:17:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.