Related papers: Non-native Children's Automatic Speech Assessment Challenge (NOCASA)

Non-native Children's Automatic Speech Assessment Challenge (NOCASA)

URL: http://arxiv.org/abs/2504.20678v1
Date: Tue, 29 Apr 2025 11:59:08 GMT
Title: Non-native Children's Automatic Speech Assessment Challenge (NOCASA)
Authors: Yaroslav Getman, Tamás Grósz, Mikko Kurimo, Giampiero Salvi,
Abstract summary: "NOCASA" is a data competition part of the IEEE MLSP 2025 conference.<n>It challenges participants to develop systems that can assess single-word pronunciations of young second language (L2) learners.<n>We provide a pseudo-anonymized training data (TeflonNorL2) containing 10,334 recordings from 44 speakers attempting to pronounce 205 distinct Norwegian words.
Score: 15.921285405887009
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper presents the "Non-native Children's Automatic Speech Assessment" (NOCASA) - a data competition part of the IEEE MLSP 2025 conference. NOCASA challenges participants to develop new systems that can assess single-word pronunciations of young second language (L2) learners as part of a gamified pronunciation training app. To achieve this, several issues must be addressed, most notably the limited nature of available training data and the highly unbalanced distribution among the pronunciation level categories. To expedite the development, we provide a pseudo-anonymized training data (TeflonNorL2), containing 10,334 recordings from 44 speakers attempting to pronounce 205 distinct Norwegian words, human-rated on a 1 to 5 scale (number of stars that should be given in the game). In addition to the data, two already trained systems are released as official baselines: an SVM classifier trained on the ComParE_16 acoustic feature set and a multi-task wav2vec 2.0 model. The latter achieves the best performance on the challenge test set, with an unweighted average recall (UAR) of 36.37%.

Related papers

Zero-Shot Speech LLMs for Multi-Aspect Evaluation of L2 Speech: Challenges and Opportunities [8.300738063140129]
This paper evaluates the zero-shot performance of Qwen2-Audio-7B-Instruct, an instruction-tuned speech-LLM, on 5,000 Speechocean762 utterances.<n>The model generates scores for accuracy, fluency, prosody, and completeness, showing strong agreement with human ratings within +-2 tolerance.
arXiv Detail & Related papers (2026-01-20T15:48:38Z)
Towards stable AI systems for Evaluating Arabic Pronunciations [0.7999703756441757]
We show that this phoneme-level task is challenging because isolated letters lack co-articulatory cues, provide no lexical context, and last only a few hundred milliseconds.<n>This study introduces a diverse, diacritised corpus of isolated Arabic letters and demonstrates that state-of-the-art wav2vec 2.0 models achieve only 35% accuracy on it.
arXiv Detail & Related papers (2025-08-27T05:49:15Z)
SLRTP2025 Sign Language Production Challenge: Methodology, Results, and Future Work [87.9341538630949]
The first Sign Language Production Challenge was held as part of the third SLRTP Workshop at CVPR 2025.<n>The competition's aims are to evaluate architectures that translate from spoken language sentences to a sequence of skeleton poses.<n>This paper presents the challenge design and the winning methodologies.
arXiv Detail & Related papers (2025-08-09T11:57:33Z)
The NTNU System at the S&I Challenge 2025 SLA Open Track [4.3128061558581585]
We propose a system that integrates W2V with Phi-4 multimodal large language model (MLLM) through a score fusion strategy.<n>The proposed system achieves a root mean square error (RMSE) of 0.375 on the official test set of the Speak & Improve Challenge 2025.<n>For comparison, the RMSEs of the top-ranked, third-ranked, and official baseline systems are 0.364, 0.384, and 0.444, respectively.
arXiv Detail & Related papers (2025-06-05T15:09:23Z)
Chain-of-Thought Training for Open E2E Spoken Dialogue Systems [57.77235760292348]
End-to-end (E2E) spoken dialogue systems preserve full differentiability and capture non-phonemic information.<n>We propose a chain-of-thought (CoT) formulation to ensure that training on conversational data remains closely aligned with the multimodal language model.<n>Our method achieves over 1.5 ROUGE-1 improvement over the baseline, successfully training spoken dialogue systems on publicly available human-human conversation datasets.
arXiv Detail & Related papers (2025-05-31T21:43:37Z)
Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora [84.03928547166873]
Children can acquire language from less than 100 million words of input.<n>Large language models are far less data-efficient: they typically require 3 or 4 orders of magnitude more data and still do not perform as well as humans on many evaluations.<n>The BabyLM Challenge is a communal effort in which participants compete to optimize language model training on a fixed data budget.
arXiv Detail & Related papers (2025-04-10T23:22:43Z)
Speak & Improve Challenge 2025: Tasks and Baseline Systems [28.877872578497854]
"Speak & Improve Challenge 2025: Spoken Language Assessment and Feedback" is a challenge associated with the ISCA SLaTE 2025 Workshop.<n>The goal of the challenge is to advance research on spoken language assessment and feedback, with tasks associated with both the underlying technology and language learning feedback.<n>The challenge has four shared tasks: Automatic Speech Recognition (ASR), Spoken Language Assessment (SLA), Spoken Grammatical Error Correction (SGEC), and Spoken Grammatical Error Correction Feedback (SGECF)
arXiv Detail & Related papers (2024-12-16T17:05:18Z)
Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation [71.31331402404662]
This paper proposes two novel data-efficient methods to learn dysarthric and elderly speaker-level features. Speaker-regularized spectral basis embedding-SBE features that exploit a special regularization term to enforce homogeneity of speaker features in adaptation. Feature-based learning hidden unit contributions (f-LHUC) that are conditioned on VR-LH features that are shown to be insensitive to speaker-level data quantity in testtime adaptation.
arXiv Detail & Related papers (2024-07-08T18:20:24Z)
Dialect Adaptation and Data Augmentation for Low-Resource ASR: TalTech Systems for the MADASR 2023 Challenge [2.018088271426157]
This paper describes Tallinn University of Technology (TalTech) systems developed for the ASRU MADASR 2023 Challenge. The challenge focuses on automatic speech recognition of dialect-rich Indian languages with limited training audio and text data. TalTech participated in two tracks of the challenge: Track 1 that allowed using only the provided training data and Track 3 which allowed using additional audio data.
arXiv Detail & Related papers (2023-10-26T14:57:08Z)
KIT's Multilingual Speech Translation System for IWSLT 2023 [58.5152569458259]
We describe our speech translation system for the multilingual track of IWSLT 2023. The task requires translation into 10 languages of varying amounts of resources. Our cascaded speech system substantially outperforms its end-to-end counterpart on scientific talk translation.
arXiv Detail & Related papers (2023-06-08T16:13:20Z)
ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation [79.66359274050885]
We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models. Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks.
arXiv Detail & Related papers (2023-05-24T07:42:15Z)
Nonwords Pronunciation Classification in Language Development Tests for Preschool Children [7.224391516694955]
This work aims to automatically evaluate whether the language development of children is age-appropriate. In this work, the task is to determine whether spoken nonwords have been uttered correctly. We compare different approaches that are motivated to model specific language structures.
arXiv Detail & Related papers (2022-06-16T10:19:47Z)
Low Resource German ASR with Untranscribed Data Spoken by Non-native Children -- INTERSPEECH 2021 Shared Task SPAPL System [19.435571932141364]
This paper describes the SPAPL system for the INTERSPEECH 2021 Challenge: Shared Task on Automatic Speech Recognition for Non-Native Children's Speech in German. 5 hours of transcribed data and 60 hours of untranscribed data are provided to develop a German ASR system for children. For the training of the transcribed data, we propose a non-speech state discriminative loss (NSDL) to mitigate the influence of long-duration non-speech segments within speech utterances. Our system achieves a word error rate (WER) of 39.68% on the evaluation data,
arXiv Detail & Related papers (2021-06-18T07:36:26Z)
Unsupervised Speech Recognition [55.864459085947345]
wav2vec-U, short for wav2vec Unsupervised, is a method to train speech recognition models without any labeled data. We leverage self-supervised speech representations to segment unlabeled audio and learn a mapping from these representations to phonemes via adversarial training. On the larger English Librispeech benchmark, wav2vec-U achieves a word error rate of 5.9 on test-other, rivaling some of the best published systems trained on 960 hours of labeled data from only two years ago.
arXiv Detail & Related papers (2021-05-24T04:10:47Z)
Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate. We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique. Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z)
Improving Proper Noun Recognition in End-to-End ASR By Customization of the MWER Loss Criterion [33.043533068435366]
Proper nouns present a challenge for end-to-end (E2E) automatic speech recognition (ASR) systems. Unlike conventional ASR models, E2E systems lack an explicit pronounciation model that can be specifically trained with proper noun pronounciations. This paper builds on recent advances in minimum word error rate (MWER) training to develop two new loss criteria that specifically emphasize proper noun recognition.
arXiv Detail & Related papers (2020-05-19T21:10:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.