Related papers: Improving Rare Words Recognition through Homophone Extension and Unified Writing for Low-resource Cantonese Speech Recognition

Improving Rare Words Recognition through Homophone Extension and Unified Writing for Low-resource Cantonese Speech Recognition

URL: http://arxiv.org/abs/2302.00836v1
Date: Thu, 2 Feb 2023 02:46:32 GMT
Title: Improving Rare Words Recognition through Homophone Extension and Unified Writing for Low-resource Cantonese Speech Recognition
Authors: HoLam Chung, Junan Li, Pengfei Liu1, Wai-Kim Leung, Xixin Wu, Helen Meng
Abstract summary: Homophone characters are common in tonal syllable-based languages, such as Mandarin and Cantonese. This paper presents a novel homophone extension method to integrate human knowledge of the homophone lexicon into the beam search decoding process. We also propose an automatic unified writing method to merge the variants of Cantonese characters.
Score: 36.10245119706219
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Homophone characters are common in tonal syllable-based languages, such as Mandarin and Cantonese. The data-intensive end-to-end Automatic Speech Recognition (ASR) systems are more likely to mis-recognize homophone characters and rare words under low-resource settings. For the problem of lowresource Cantonese speech recognition, this paper presents a novel homophone extension method to integrate human knowledge of the homophone lexicon into the beam search decoding process with language model re-scoring. Besides, we propose an automatic unified writing method to merge the variants of Cantonese characters and standardize speech annotation guidelines, which enables more efficient utilization of labeled utterances by providing more samples for the merged characters. We empirically show that both homophone extension and unified writing improve the recognition performance significantly on both in-domain and out-of-domain test sets, with an absolute Character Error Rate (CER) decrease of around 5% and 18%.

Related papers

Context Biasing for Pronunciations-Orthography Mismatch in Automatic Speech Recognition [56.972851337263755]
We propose a method which allows corrections of substitution errors to improve the recognition accuracy of challenging words.<n>We show that with this method we get a relative improvement in biased word error rate of up to 11%, while maintaining a competitive overall word error rate.
arXiv Detail & Related papers (2025-06-23T14:42:03Z)
Languages in Multilingual Speech Foundation Models Align Both Phonetically and Semantically [58.019484208091534]
Cross-lingual alignment in pretrained language models (LMs) has enabled efficient transfer in text-based LMs.<n>It remains an open question whether findings and methods from text-based cross-lingual alignment apply to speech.
arXiv Detail & Related papers (2025-05-26T07:21:20Z)
Optimizing Two-Pass Cross-Lingual Transfer Learning: Phoneme Recognition and Phoneme to Grapheme Translation [9.118302330129284]
This research optimize two-pass cross-lingual transfer learning in low-resource languages. We optimize phoneme vocabulary coverage by merging phonemes based on shared articulatory characteristics. We introduce a global phoneme noise generator for realistic ASR noise during phoneme-to-grapheme training to reduce error propagation.
arXiv Detail & Related papers (2023-12-06T06:37:24Z)
MUST&P-SRL: Multi-lingual and Unified Syllabification in Text and Phonetic Domains for Speech Representation Learning [0.76146285961466]
We present a methodology for linguistic feature extraction, focusing on automatically syllabifying words in multiple languages. In both the textual and phonetic domains, our method focuses on the extraction of phonetic transcriptions from text, stress marks, and a unified automatic syllabification. The system was built with open-source components and resources.
arXiv Detail & Related papers (2023-10-17T19:27:23Z)
SoundChoice: Grapheme-to-Phoneme Models with Semantic Disambiguation [10.016862617549991]
This paper proposes SoundChoice, a novel Grapheme-to-Phoneme (G2P) architecture that processes entire sentences rather than operating at the word level. SoundChoice achieves a Phoneme Error Rate (PER) of 2.65% on whole-sentence transcription using data from LibriSpeech and Wikipedia.
arXiv Detail & Related papers (2022-07-27T01:14:59Z)
Differentiable Allophone Graphs for Language-Universal Speech Recognition [77.2981317283029]
Building language-universal speech recognition systems entails producing phonological units of spoken sound that can be shared across languages. We present a general framework to derive phone-level supervision from only phonemic transcriptions and phone-to-phoneme mappings. We build a universal phone-based speech recognition model with interpretable probabilistic phone-to-phoneme mappings for each language.
arXiv Detail & Related papers (2021-07-24T15:09:32Z)
Spoken Term Detection Methods for Sparse Transcription in Very Low-resource Settings [20.410074074340447]
Experiments on two oral languages show that a pretrained universal phone recognizer, fine-tuned with only a few minutes of target language speech, can be used for spoken term detection. We show that representing phoneme recognition ambiguity in a graph structure can further boost the recall while maintaining high precision in the low resource spoken term detection task.
arXiv Detail & Related papers (2021-06-11T04:09:54Z)
Phoneme Recognition through Fine Tuning of Phonetic Representations: a Case Study on Luhya Language Varieties [77.2347265289855]
We focus on phoneme recognition using Allosaurus, a method for multilingual recognition based on phonetic annotation. To evaluate in a challenging real-world scenario, we curate phone recognition datasets for Bukusu and Saamia, two varieties of the Luhya language cluster of western Kenya and eastern Uganda. We find that fine-tuning of Allosaurus, even with just 100 utterances, leads to significant improvements in phone error rates.
arXiv Detail & Related papers (2021-04-04T15:07:55Z)
UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data [54.733889961024445]
We propose a unified pre-training approach called UniSpeech to learn speech representations with both unlabeled and labeled data. We evaluate the effectiveness of UniSpeech for cross-lingual representation learning on public CommonVoice corpus.
arXiv Detail & Related papers (2021-01-19T12:53:43Z)
Acoustics Based Intent Recognition Using Discovered Phonetic Units for Low Resource Languages [51.0542215642794]
We propose a novel acoustics based intent recognition system that uses discovered phonetic units for intent classification. We present results for two languages families - Indic languages and Romance languages, for two different intent recognition tasks.
arXiv Detail & Related papers (2020-11-07T00:35:31Z)
Homophone-based Label Smoothing in End-to-End Automatic Speech Recognition [8.066444614339614]
The proposed method uses pronunciation knowledge of homophones in a more complex way. Experiments with hybrid CTC sequence-to-sequence model show that the new method can reduce character error rate (CER) by 0.4% absolutely.
arXiv Detail & Related papers (2020-04-07T14:37:30Z)
Universal Phone Recognition with a Multilingual Allophone System [135.2254086165086]
We propose a joint model of language-independent phone and language-dependent phoneme distributions. In multilingual ASR experiments over 11 languages, we find that this model improves testing performance by 2% phoneme error rate absolute. Our recognizer achieves phone accuracy improvements of more than 17%, moving a step closer to speech recognition for all languages in the world.
arXiv Detail & Related papers (2020-02-26T21:28:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.