Speech Recognition for Automatically Assessing Afrikaans and isiXhosa   Preschool Oral Narratives
        - URL: http://arxiv.org/abs/2501.06478v1
 - Date: Sat, 11 Jan 2025 08:11:09 GMT
 - Title: Speech Recognition for Automatically Assessing Afrikaans and isiXhosa   Preschool Oral Narratives
 - Authors: Christiaan Jacobs, Annelien Smith, Daleen Klop, Ondřej Klejch, Febe de Wet, Herman Kamper, 
 - Abstract summary: We develop automatic speech recognition systems for stories told by Afrikaans and isiXhosa preschool children.<n>We consider a range of prior child-speech ASR strategies to determine which is best suited to this unique setting.
 - Score: 15.669164862460342
 - License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
 - Abstract:   We develop automatic speech recognition (ASR) systems for stories told by Afrikaans and isiXhosa preschool children. Oral narratives provide a way to assess children's language development before they learn to read. We consider a range of prior child-speech ASR strategies to determine which is best suited to this unique setting. Using Whisper and only 5 minutes of transcribed in-domain child speech, we find that additional in-domain adult data (adult speech matching the story domain) provides the biggest improvement, especially when coupled with voice conversion. Semi-supervised learning also helps for both languages, while parameter-efficient fine-tuning helps on Afrikaans but not on isiXhosa (which is under-represented in the Whisper model). Few child-speech studies look at non-English data, and even fewer at the preschool ages of 4 and 5. Our work therefore represents a unique validation of a wide range of previous child-speech ASR strategies in an under-explored setting. 
 
       
      
        Related papers
        - Efficient Training for Multilingual Visual Speech Recognition:   Pre-training with Discretized Visual Speech Representation [55.15299351110525]
This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model.
We propose a novel training strategy, processing with visual speech units.
We set new state-of-the-art multilingual VSR performances by achieving comparable performances to the previous language-specific VSR models.
arXiv  Detail & Related papers  (2024-01-18T08:46:02Z) - Adaptation of Whisper models to child speech recognition [3.2548794659022398]
We show that finetuning Whisper on child speech yields significant improvements in ASR performance on child speech.
 utilizing self-supervised Wav2vec2 models that have been finetuned on child speech outperforms Whisper finetuning.
arXiv  Detail & Related papers  (2023-07-24T12:54:45Z) - Automatic Speech Recognition of Non-Native Child Speech for Language
  Learning Applications [18.849741353784328]
We assess the performance of two state-of-the-art ASR systems, Wav2Vec2.0 and Whisper AI.
We evaluate their performance on read and extemporaneous speech of native and non-native Dutch children.
arXiv  Detail & Related papers  (2023-06-29T06:14:26Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv  Detail & Related papers  (2023-06-22T14:37:54Z) - Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv  Detail & Related papers  (2023-03-14T17:05:08Z) - MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup
  for Visual Speech Translation and Recognition [51.412413996510814]
We propose MixSpeech, a cross-modality self-learning framework that utilizes audio speech to regularize the training of visual speech tasks.
MixSpeech enhances speech translation in noisy environments, improving BLEU scores for four languages on AVMuST-TED by +1.4 to +4.2.
arXiv  Detail & Related papers  (2023-03-09T14:58:29Z) - Improving Children's Speech Recognition by Fine-tuning Self-supervised
  Adult Speech Representations [2.2191297646252646]
Children's speech recognition is a vital, yet largely overlooked domain when building inclusive speech technologies.
Recent advances in self-supervised learning have created a new opportunity for overcoming this problem of data scarcity.
We leverage self-supervised adult speech representations and use three well-known child speech corpora to build models for children's speech recognition.
arXiv  Detail & Related papers  (2022-11-14T22:03:36Z) - SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation.
Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities.
We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv  Detail & Related papers  (2022-09-30T09:12:10Z) - Transfer Learning for Robust Low-Resource Children's Speech ASR with
  Transformers and Source-Filter Warping [11.584388304271029]
We propose a data augmentation technique based on the source-filter model of speech to close the domain gap between adult and children's speech.
Using this augmentation strategy, we apply transfer learning on a Transformer model pre-trained on adult data.
This model follows the recently introduced XLS-R architecture, a wav2vec 2.0 model pre-trained on several cross-lingual adult speech corpora.
arXiv  Detail & Related papers  (2022-06-19T12:57:47Z) - ASR data augmentation in low-resource settings using cross-lingual
  multi-speaker TTS and cross-lingual voice conversion [49.617722668505834]
We show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training.
It is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
arXiv  Detail & Related papers  (2022-03-29T11:55:30Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
  for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv  Detail & Related papers  (2021-10-11T00:08:48Z) - UniSpeech: Unified Speech Representation Learning with Labeled and
  Unlabeled Data [54.733889961024445]
We propose a unified pre-training approach called UniSpeech to learn speech representations with both unlabeled and labeled data.
We evaluate the effectiveness of UniSpeech for cross-lingual representation learning on public CommonVoice corpus.
arXiv  Detail & Related papers  (2021-01-19T12:53:43Z) - Learning to Understand Child-directed and Adult-directed Speech [18.29692441616062]
Human language acquisition research indicates that child-directed speech helps language learners.
We compare the task performance of models trained on adult-directed speech (ADS) and child-directed speech (CDS)
We find indications that CDS helps in the initial stages of learning, but eventually, models trained on ADS reach comparable task performance, and generalize better.
arXiv  Detail & Related papers  (2020-05-06T10:47:02Z) 
        This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.