Improving grapheme-to-phoneme conversion by learning pronunciations from
speech recordings
- URL: http://arxiv.org/abs/2307.16643v1
- Date: Mon, 31 Jul 2023 13:25:38 GMT
- Title: Improving grapheme-to-phoneme conversion by learning pronunciations from
speech recordings
- Authors: Manuel Sam Ribeiro, Giulia Comini, Jaime Lorenzo-Trueba
- Abstract summary: The Grapheme-to-Phoneme (G2P) task aims to convert orthographic input into a discrete phonetic representation.
We propose a method to improve the G2P conversion task by learning pronunciation examples from audio recordings.
- Score: 12.669655363646257
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Grapheme-to-Phoneme (G2P) task aims to convert orthographic input into a
discrete phonetic representation. G2P conversion is beneficial to various
speech processing applications, such as text-to-speech and speech recognition.
However, these tend to rely on manually-annotated pronunciation dictionaries,
which are often time-consuming and costly to acquire. In this paper, we propose
a method to improve the G2P conversion task by learning pronunciation examples
from audio recordings. Our approach bootstraps a G2P with a small set of
annotated examples. The G2P model is used to train a multilingual phone
recognition system, which then decodes speech recordings with a phonetic
representation. Given hypothesized phoneme labels, we learn pronunciation
dictionaries for out-of-vocabulary words, and we use those to re-train the G2P
system. Results indicate that our approach consistently improves the phone
error rate of G2P systems across languages and amount of available data.
Related papers
- AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - The Effects of Input Type and Pronunciation Dictionary Usage in Transfer
Learning for Low-Resource Text-to-Speech [1.1852406625172218]
We compare phone labels and articulatory features as input for cross-lingual transfer learning in text-to-speech for low-resource languages (LRLs)
Experiments with FastSpeech 2 and the LRL West Frisian show that using articulatory features outperformed using phone labels in both intelligibility and naturalness.
arXiv Detail & Related papers (2023-06-01T10:42:56Z) - MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech
Recognition [75.12948999653338]
We propose a novel multi-task encoder-decoder pre-training framework (MMSpeech) for Mandarin automatic speech recognition (ASR)
We employ a multi-task learning framework including five self-supervised and supervised tasks with speech and text data.
Experiments on AISHELL-1 show that our proposed method achieves state-of-the-art performance, with a more than 40% relative improvement compared with other pre-training methods.
arXiv Detail & Related papers (2022-11-29T13:16:09Z) - SoundChoice: Grapheme-to-Phoneme Models with Semantic Disambiguation [10.016862617549991]
This paper proposes SoundChoice, a novel Grapheme-to-Phoneme (G2P) architecture that processes entire sentences rather than operating at the word level.
SoundChoice achieves a Phoneme Error Rate (PER) of 2.65% on whole-sentence transcription using data from LibriSpeech and Wikipedia.
arXiv Detail & Related papers (2022-07-27T01:14:59Z) - Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for
Text-to-Speech [88.22544315633687]
Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech systems.
We propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary.
Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy.
arXiv Detail & Related papers (2022-06-05T10:50:34Z) - r-G2P: Evaluating and Enhancing Robustness of Grapheme to Phoneme
Conversion by Controlled noise introducing and Contextual information
incorporation [32.75866643254402]
We show that neural G2P models are extremely sensitive to orthographical variations in graphemes like spelling mistakes.
We propose three controlled noise introducing methods to synthesize noisy training data.
We incorporate the contextual information with the baseline and propose a robust training strategy to stabilize the training process.
arXiv Detail & Related papers (2022-02-21T13:29:30Z) - Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language.
We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z) - Differentiable Allophone Graphs for Language-Universal Speech
Recognition [77.2981317283029]
Building language-universal speech recognition systems entails producing phonological units of spoken sound that can be shared across languages.
We present a general framework to derive phone-level supervision from only phonemic transcriptions and phone-to-phoneme mappings.
We build a universal phone-based speech recognition model with interpretable probabilistic phone-to-phoneme mappings for each language.
arXiv Detail & Related papers (2021-07-24T15:09:32Z) - Grapheme-to-Phoneme Transformer Model for Transfer Learning Dialects [1.3786433185027864]
Grapheme-to-Phoneme (G2P) models convert words to their phonetic pronunciations.
Usually, dictionary-based methods require significant manual effort to build, and have limited adaptivity on unseen words.
We propose a novel use of transformer-based attention model that can adapt to unseen dialects of English language, while using a small dictionary.
arXiv Detail & Related papers (2021-04-08T21:36:21Z) - Acoustics Based Intent Recognition Using Discovered Phonetic Units for
Low Resource Languages [51.0542215642794]
We propose a novel acoustics based intent recognition system that uses discovered phonetic units for intent classification.
We present results for two languages families - Indic languages and Romance languages, for two different intent recognition tasks.
arXiv Detail & Related papers (2020-11-07T00:35:31Z) - Neural Machine Translation for Multilingual Grapheme-to-Phoneme
Conversion [13.543705472805431]
We present a single end-to-end trained neural G2P model that shares same encoder and decoder across multiple languages.
We show 7.2% average improvement in phoneme error rate over low resource languages and no over high resource ones compared to monolingual baselines.
arXiv Detail & Related papers (2020-06-25T06:16:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.