LLM-based phoneme-to-grapheme for phoneme-based speech recognition
- URL: http://arxiv.org/abs/2506.04711v1
- Date: Thu, 05 Jun 2025 07:35:55 GMT
- Title: LLM-based phoneme-to-grapheme for phoneme-based speech recognition
- Authors: Te Ma, Min Bi, Saierdaer Yusuyin, Hao Huang, Zhijian Ou,
- Abstract summary: We propose phoneme-to-grapheme (LLM-P2G) decoding for phoneme-based automatic speech recognition (ASR)<n>Our experimental results show that LLM-P2G outperforms WFST-based systems in crosslingual ASR for Polish and German, by relative WER reductions of 3.6% and 6.9% respectively.
- Score: 11.552927239284582
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In automatic speech recognition (ASR), phoneme-based multilingual pre-training and crosslingual fine-tuning is attractive for its high data efficiency and competitive results compared to subword-based models. However, Weighted Finite State Transducer (WFST) based decoding is limited by its complex pipeline and inability to leverage large language models (LLMs). Therefore, we propose LLM-based phoneme-to-grapheme (LLM-P2G) decoding for phoneme-based ASR, consisting of speech-to-phoneme (S2P) and phoneme-to-grapheme (P2G). A challenge is that there seems to have information loss in cascading S2P and P2G. To address this challenge, we propose two training strategies: data augmentation with noisy phonemes (DANP), and randomized top-$K$ marginalized (TKM) training and decoding. Our experimental results show that LLM-P2G outperforms WFST-based systems in crosslingual ASR for Polish and German, by relative WER reductions of 3.6% and 6.9% respectively.
Related papers
- Pronunciation-Lexicon Free Training for Phoneme-based Crosslingual ASR via Joint Stochastic Approximation [12.39451124683428]
We propose a latent variable model based method, with phonemes being treated as discrete latent variables.<n>Based on a multilingual pre-trained S2P model, crosslingual experiments are conducted in Polish and Indonesian.<n>With only 10 minutes of phoneme supervision, the new method, JSA-SPG, achieves 5% error rate reductions.
arXiv Detail & Related papers (2025-07-04T12:23:22Z) - Zero-resource Speech Translation and Recognition with LLMs [38.11535502039386]
We propose to leverage a multilingual Large Language Model (LLM) to perform ST and ASR in languages for which the model has never seen paired audio-text data.<n>We achieve this by using a pre-trained multilingual speech encoder, a multilingual LLM, and a lightweight adaptation module that maps the audio representations to the token embedding space of the LLM.
arXiv Detail & Related papers (2024-12-24T17:37:11Z) - Large Language Models are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities.<n>We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities.<n>We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.79% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z) - LLM-Powered Grapheme-to-Phoneme Conversion: Benchmark and Case Study [2.8948274245812327]
Grapheme-to-phoneme (G2P) conversion is critical in speech processing.
Large language models (LLMs) have recently demonstrated significant potential in various language tasks.
We present a benchmarking dataset designed to assess G2P performance on sentence-level phonetic challenges of the Persian language.
arXiv Detail & Related papers (2024-09-13T06:13:55Z) - Optimizing Two-Pass Cross-Lingual Transfer Learning: Phoneme Recognition
and Phoneme to Grapheme Translation [9.118302330129284]
This research optimize two-pass cross-lingual transfer learning in low-resource languages.
We optimize phoneme vocabulary coverage by merging phonemes based on shared articulatory characteristics.
We introduce a global phoneme noise generator for realistic ASR noise during phoneme-to-grapheme training to reduce error propagation.
arXiv Detail & Related papers (2023-12-06T06:37:24Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - Few-Shot Cross-Lingual TTS Using Transferable Phoneme Embedding [55.989376102986654]
This paper studies a transferable phoneme embedding framework that aims to deal with the cross-lingual text-to-speech problem under the few-shot setting.
We propose a framework that consists of a phoneme-based TTS model and a codebook module to project phonemes from different languages into a learned latent space.
arXiv Detail & Related papers (2022-06-27T11:24:40Z) - Deliberation Model for On-Device Spoken Language Understanding [69.5587671262691]
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU)
We show that our approach can significantly reduce the degradation when moving from natural speech to synthetic speech training.
arXiv Detail & Related papers (2022-04-04T23:48:01Z) - MixSpeech: Data Augmentation for Low-resource Automatic Speech
Recognition [54.84624870942339]
MixSpeech is a simple yet effective data augmentation method based on mixup for automatic speech recognition (ASR)
We apply MixSpeech on two popular end-to-end speech recognition models including LAS (Listen, Attend and Spell) and Transformer.
Experimental results show that MixSpeech achieves better accuracy than the baseline models without data augmentation.
arXiv Detail & Related papers (2021-02-25T03:40:43Z) - Improving Perceptual Quality by Phone-Fortified Perceptual Loss using
Wasserstein Distance for Speech Enhancement [23.933935913913043]
We propose a phone-fortified perceptual loss (PFPL) that takes phonetic information into account for training SE models.
To effectively incorporate the phonetic information, the PFPL is computed based on latent representations of the wav2vec model.
Our experimental results first reveal that the PFPL is more correlated with the perceptual evaluation metrics, as compared to signal-level losses.
arXiv Detail & Related papers (2020-10-28T18:34:28Z) - Decoupling Pronunciation and Language for End-to-end Code-switching
Automatic Speech Recognition [66.47000813920617]
We propose a decoupled transformer model to use monolingual paired data and unpaired text data.
The model is decoupled into two parts: audio-to-phoneme (A2P) network and phoneme-to-text (P2T) network.
By using monolingual data and unpaired text data, the decoupled transformer model reduces the high dependency on code-switching paired training data of E2E model.
arXiv Detail & Related papers (2020-10-28T07:46:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.