Nonwords Pronunciation Classification in Language Development Tests for
Preschool Children
- URL: http://arxiv.org/abs/2206.08058v2
- Date: Fri, 17 Jun 2022 07:08:27 GMT
- Title: Nonwords Pronunciation Classification in Language Development Tests for
Preschool Children
- Authors: Ilja Baumann, Dominik Wagner, Sebastian Bayerl, Tobias Bocklet
- Abstract summary: This work aims to automatically evaluate whether the language development of children is age-appropriate.
In this work, the task is to determine whether spoken nonwords have been uttered correctly.
We compare different approaches that are motivated to model specific language structures.
- Score: 7.224391516694955
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work aims to automatically evaluate whether the language development of
children is age-appropriate. Validated speech and language tests are used for
this purpose to test the auditory memory. In this work, the task is to
determine whether spoken nonwords have been uttered correctly. We compare
different approaches that are motivated to model specific language structures:
Low-level features (FFT), speaker embeddings (ECAPA-TDNN), grapheme-motivated
embeddings (wav2vec 2.0), and phonetic embeddings in form of senones (ASR
acoustic model). Each of the approaches provides input for VGG-like 5-layer CNN
classifiers. We also examine the adaptation per nonword. The evaluation of the
proposed systems was performed using recordings from different kindergartens of
spoken nonwords. ECAPA-TDNN and low-level FFT features do not explicitly model
phonetic information; wav2vec2.0 is trained on grapheme labels, our ASR
acoustic model features contain (sub-)phonetic information. We found that the
more granular the phonetic modeling is, the higher are the achieved recognition
rates. The best system trained on ASR acoustic model features with VTLN
achieved an accuracy of 89.4% and an area under the ROC (Receiver Operating
Characteristic) curve (AUC) of 0.923. This corresponds to an improvement in
accuracy of 20.2% and AUC of 0.309 relative compared to the FFT-baseline.
Related papers
- Language Modelling for Speaker Diarization in Telephonic Interviews [13.851959980488529]
Combination of acoustic features and linguistic content shows a 84.29% improvement in terms of a word-level DER.
The results of this study confirms that linguistic content can be efficiently used for some speaker recognition tasks.
arXiv Detail & Related papers (2025-01-28T18:18:04Z) - Improving Multilingual ASR in the Wild Using Simple N-best Re-ranking [68.77659513993507]
We present a simple and effective N-best re-ranking approach to improve multilingual ASR accuracy.
Our results show spoken language identification accuracy improvements of 8.7% and 6.1%, respectively, and word error rates which are 3.3% and 2.0% lower on these benchmarks.
arXiv Detail & Related papers (2024-09-27T03:31:32Z) - CLAIR-A: Leveraging Large Language Models to Judge Audio Captions [73.51087998971418]
evaluating machine-generated audio captions is a complex task that requires considering diverse factors.
We propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models.
In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics.
arXiv Detail & Related papers (2024-09-19T17:59:52Z) - Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - Analysing the Impact of Audio Quality on the Use of Naturalistic
Long-Form Recordings for Infant-Directed Speech Research [62.997667081978825]
Modelling of early language acquisition aims to understand how infants bootstrap their language skills.
Recent developments have enabled the use of more naturalistic training data for computational models.
It is currently unclear how the sound quality could affect analyses and modelling experiments conducted on such data.
arXiv Detail & Related papers (2023-05-03T08:25:37Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - Is Attention always needed? A Case Study on Language Identification from
Speech [1.162918464251504]
The present study introduces convolutional recurrent neural network (CRNN) based LID.
CRNN based LID is designed to operate on the Mel-frequency Cepstral Coefficient (MFCC) characteristics of audio samples.
The LID model exhibits high-performance levels ranging from 97% to 100% for languages that are linguistically similar.
arXiv Detail & Related papers (2021-10-05T16:38:57Z) - Private Language Model Adaptation for Speech Recognition [15.726921748859393]
Speech model adaptation is crucial to handle the discrepancy between server-side proxy training data and actual data received on users' local devices.
We introduce an efficient approach on continuously adapting neural network language models (NNLMs) on private devices with applications on automatic speech recognition.
arXiv Detail & Related papers (2021-09-28T00:15:43Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z) - DNN-Based Semantic Model for Rescoring N-best Speech Recognition List [8.934497552812012]
The word error rate (WER) of an automatic speech recognition (ASR) system increases when a mismatch occurs between the training and the testing conditions due to the noise, etc.
This work aims to improve ASR by modeling long-term semantic relations to compensate for distorted acoustic features.
arXiv Detail & Related papers (2020-11-02T13:50:59Z) - Data augmentation using prosody and false starts to recognize non-native
children's speech [12.911954427107977]
This paper describes AaltoASR's speech recognition system for the INTERSPEECH 2020 shared task on Automatic Speech Recognition.
The task is to recognize non-native speech from children of various age groups given a limited amount of speech.
arXiv Detail & Related papers (2020-08-29T05:32:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.