Related papers: SoundChoice: Grapheme-to-Phoneme Models with Semantic Disambiguation

SoundChoice: Grapheme-to-Phoneme Models with Semantic Disambiguation

URL: http://arxiv.org/abs/2207.13703v1
Date: Wed, 27 Jul 2022 01:14:59 GMT
Title: SoundChoice: Grapheme-to-Phoneme Models with Semantic Disambiguation
Authors: Artem Ploujnikov, Mirco Ravanelli
Abstract summary: This paper proposes SoundChoice, a novel Grapheme-to-Phoneme (G2P) architecture that processes entire sentences rather than operating at the word level. SoundChoice achieves a Phoneme Error Rate (PER) of 2.65% on whole-sentence transcription using data from LibriSpeech and Wikipedia.
Score: 10.016862617549991
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: End-to-end speech synthesis models directly convert the input characters into an audio representation (e.g., spectrograms). Despite their impressive performance, such models have difficulty disambiguating the pronunciations of identically spelled words. To mitigate this issue, a separate Grapheme-to-Phoneme (G2P) model can be employed to convert the characters into phonemes before synthesizing the audio. This paper proposes SoundChoice, a novel G2P architecture that processes entire sentences rather than operating at the word level. The proposed architecture takes advantage of a weighted homograph loss (that improves disambiguation), exploits curriculum learning (that gradually switches from word-level to sentence-level G2P), and integrates word embeddings from BERT (for further performance improvement). Moreover, the model inherits the best practices in speech recognition, including multi-task learning with Connectionist Temporal Classification (CTC) and beam search with an embedded language model. As a result, SoundChoice achieves a Phoneme Error Rate (PER) of 2.65% on whole-sentence transcription using data from LibriSpeech and Wikipedia. Index Terms grapheme-to-phoneme, speech synthesis, text-tospeech, phonetics, pronunciation, disambiguation.

Related papers

POWSM: A Phonetic Open Whisper-Style Speech Foundation Model [50.73202227472358]
POWSM is the first unified framework capable of jointly performing multiple phone-related tasks.<n>Our training data, code and models are released to foster open science.
arXiv Detail & Related papers (2025-10-28T21:43:45Z)
MixedG2P-T5: G2P-free Speech Synthesis for Mixed-script texts using Speech Self-Supervised Learning and Language Model [17.060696046727962]
This study presents a novel approach to voice synthesis that can substitute the traditional grapheme-to-phoneme (G2P) conversion.<n>We train a T5 encoder to produce pseudo-language labels from mixed-script texts.<n>Our model matches the performance of conventional G2P-based text-to-speech systems and is capable of synthesizing speech that retains natural linguistic and paralinguistic features.
arXiv Detail & Related papers (2025-09-01T11:36:37Z)
Learning Speech Representation From Contrastive Token-Acoustic Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space. The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z)
Multilingual context-based pronunciation learning for Text-to-Speech [13.941800219395757]
Phonetic information and linguistic knowledge are an essential component of a Text-to-speech (TTS) front-end. We showcase a multilingual unified front-end system that addresses any pronunciation related task, typically handled by separate modules. We find that the multilingual model is competitive across languages and tasks, however, some trade-offs exists when compared to equivalent monolingual solutions.
arXiv Detail & Related papers (2023-07-31T14:29:06Z)
Improving grapheme-to-phoneme conversion by learning pronunciations from speech recordings [12.669655363646257]
The Grapheme-to-Phoneme (G2P) task aims to convert orthographic input into a discrete phonetic representation. We propose a method to improve the G2P conversion task by learning pronunciation examples from audio recordings.
arXiv Detail & Related papers (2023-07-31T13:25:38Z)
Good Neighbors Are All You Need for Chinese Grapheme-to-Phoneme Conversion [1.5020330976600735]
Most Chinese Grapheme-to-Phoneme (G2P) systems employ a three-stage framework that first transforms input sequences into character embeddings, obtains linguistic information using language models, and then predicts the phonemes based on global context. We propose the Reinforcer that provides strong inductive bias for language models by emphasizing the phonological information between neighboring characters to help disambiguate pronunciations.
arXiv Detail & Related papers (2023-03-14T09:15:51Z)
SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation. Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities. We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z)
Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech [88.22544315633687]
Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech systems. We propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary. Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy.
arXiv Detail & Related papers (2022-06-05T10:50:34Z)
Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation. We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead. When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z)
Grapheme-to-Phoneme Transformer Model for Transfer Learning Dialects [1.3786433185027864]
Grapheme-to-Phoneme (G2P) models convert words to their phonetic pronunciations. Usually, dictionary-based methods require significant manual effort to build, and have limited adaptivity on unseen words. We propose a novel use of transformer-based attention model that can adapt to unseen dialects of English language, while using a small dictionary.
arXiv Detail & Related papers (2021-04-08T21:36:21Z)
Learning Explicit Prosody Models and Deep Speaker Embeddings for Atypical Voice Conversion [60.808838088376675]
We propose a VC system with explicit prosodic modelling and deep speaker embedding learning. A prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values. A conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech.
arXiv Detail & Related papers (2020-11-03T13:08:53Z)
Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way. Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously. We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)
RECOApy: Data recording, pre-processing and phonetic transcription for end-to-end speech-based applications [4.619541348328938]
RECOApy streamlines the steps of data recording and pre-processing required in end-to-end speech-based applications. The tool implements an easy-to-use interface for prompted speech recording, spectrogram and waveform analysis, utterance-level normalisation and silence trimming. The grapheme-to-phoneme (G2P) converters are deep neural network (DNN) based architectures trained on lexicons extracted from the Wiktionary online collaborative resource.
arXiv Detail & Related papers (2020-09-11T15:26:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.