SoundChoice: Grapheme-to-Phoneme Models with Semantic Disambiguation
- URL: http://arxiv.org/abs/2207.13703v1
- Date: Wed, 27 Jul 2022 01:14:59 GMT
- Title: SoundChoice: Grapheme-to-Phoneme Models with Semantic Disambiguation
- Authors: Artem Ploujnikov, Mirco Ravanelli
- Abstract summary: This paper proposes SoundChoice, a novel Grapheme-to-Phoneme (G2P) architecture that processes entire sentences rather than operating at the word level.
SoundChoice achieves a Phoneme Error Rate (PER) of 2.65% on whole-sentence transcription using data from LibriSpeech and Wikipedia.
- Score: 10.016862617549991
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end speech synthesis models directly convert the input characters into
an audio representation (e.g., spectrograms). Despite their impressive
performance, such models have difficulty disambiguating the pronunciations of
identically spelled words. To mitigate this issue, a separate
Grapheme-to-Phoneme (G2P) model can be employed to convert the characters into
phonemes before synthesizing the audio. This paper proposes SoundChoice, a
novel G2P architecture that processes entire sentences rather than operating at
the word level. The proposed architecture takes advantage of a weighted
homograph loss (that improves disambiguation), exploits curriculum learning
(that gradually switches from word-level to sentence-level G2P), and integrates
word embeddings from BERT (for further performance improvement). Moreover, the
model inherits the best practices in speech recognition, including multi-task
learning with Connectionist Temporal Classification (CTC) and beam search with
an embedded language model. As a result, SoundChoice achieves a Phoneme Error
Rate (PER) of 2.65% on whole-sentence transcription using data from LibriSpeech
and Wikipedia. Index Terms grapheme-to-phoneme, speech synthesis,
text-tospeech, phonetics, pronunciation, disambiguation.
Related papers
- Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Multilingual context-based pronunciation learning for Text-to-Speech [13.941800219395757]
Phonetic information and linguistic knowledge are an essential component of a Text-to-speech (TTS) front-end.
We showcase a multilingual unified front-end system that addresses any pronunciation related task, typically handled by separate modules.
We find that the multilingual model is competitive across languages and tasks, however, some trade-offs exists when compared to equivalent monolingual solutions.
arXiv Detail & Related papers (2023-07-31T14:29:06Z) - Improving grapheme-to-phoneme conversion by learning pronunciations from
speech recordings [12.669655363646257]
The Grapheme-to-Phoneme (G2P) task aims to convert orthographic input into a discrete phonetic representation.
We propose a method to improve the G2P conversion task by learning pronunciation examples from audio recordings.
arXiv Detail & Related papers (2023-07-31T13:25:38Z) - Good Neighbors Are All You Need for Chinese Grapheme-to-Phoneme
Conversion [1.5020330976600735]
Most Chinese Grapheme-to-Phoneme (G2P) systems employ a three-stage framework that first transforms input sequences into character embeddings, obtains linguistic information using language models, and then predicts the phonemes based on global context.
We propose the Reinforcer that provides strong inductive bias for language models by emphasizing the phonological information between neighboring characters to help disambiguate pronunciations.
arXiv Detail & Related papers (2023-03-14T09:15:51Z) - SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation.
Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities.
We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z) - Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for
Text-to-Speech [88.22544315633687]
Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech systems.
We propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary.
Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy.
arXiv Detail & Related papers (2022-06-05T10:50:34Z) - Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation.
We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead.
When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z) - Grapheme-to-Phoneme Transformer Model for Transfer Learning Dialects [1.3786433185027864]
Grapheme-to-Phoneme (G2P) models convert words to their phonetic pronunciations.
Usually, dictionary-based methods require significant manual effort to build, and have limited adaptivity on unseen words.
We propose a novel use of transformer-based attention model that can adapt to unseen dialects of English language, while using a small dictionary.
arXiv Detail & Related papers (2021-04-08T21:36:21Z) - Learning Explicit Prosody Models and Deep Speaker Embeddings for
Atypical Voice Conversion [60.808838088376675]
We propose a VC system with explicit prosodic modelling and deep speaker embedding learning.
A prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values.
A conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech.
arXiv Detail & Related papers (2020-11-03T13:08:53Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z) - RECOApy: Data recording, pre-processing and phonetic transcription for
end-to-end speech-based applications [4.619541348328938]
RECOApy streamlines the steps of data recording and pre-processing required in end-to-end speech-based applications.
The tool implements an easy-to-use interface for prompted speech recording, spectrogram and waveform analysis, utterance-level normalisation and silence trimming.
The grapheme-to-phoneme (G2P) converters are deep neural network (DNN) based architectures trained on lexicons extracted from the Wiktionary online collaborative resource.
arXiv Detail & Related papers (2020-09-11T15:26:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.