Related papers: Data-driven grapheme-to-phoneme representations for a lexicon-free text-to-speech

Data-driven grapheme-to-phoneme representations for a lexicon-free text-to-speech

URL: http://arxiv.org/abs/2401.10465v1
Date: Fri, 19 Jan 2024 03:37:27 GMT
Title: Data-driven grapheme-to-phoneme representations for a lexicon-free text-to-speech
Authors: Abhinav Garg, Jiyeon Kim, Sushil Khyalia, Chanwoo Kim, Dhananjaya Gowda
Abstract summary: Grapheme-to-Phoneme (G2P) is an essential first step in any modern, high-quality Text-to-Speech (TTS) system. Most of the current G2P systems rely on carefully hand-crafted lexicons developed by experts. We show that our data-driven lexicon-free method performs as good or even marginally better than the conventional rule-based or lexicon-based neural G2Ps.
Score: 11.76320241588959
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Grapheme-to-Phoneme (G2P) is an essential first step in any modern, high-quality Text-to-Speech (TTS) system. Most of the current G2P systems rely on carefully hand-crafted lexicons developed by experts. This poses a two-fold problem. Firstly, the lexicons are generated using a fixed phoneme set, usually, ARPABET or IPA, which might not be the most optimal way to represent phonemes for all languages. Secondly, the man-hours required to produce such an expert lexicon are very high. In this paper, we eliminate both of these issues by using recent advances in self-supervised learning to obtain data-driven phoneme representations instead of fixed representations. We compare our lexicon-free approach against strong baselines that utilize a well-crafted lexicon. Furthermore, we show that our data-driven lexicon-free method performs as good or even marginally better than the conventional rule-based or lexicon-based neural G2Ps in terms of Mean Opinion Score (MOS) while using no prior language lexicon or phoneme set, i.e. no linguistic expertise.

Related papers

Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models [2.8948274245812327]
Homograph disambiguation remains a significant challenge in grapheme-to-phoneme (G2P) conversion.<n>We propose a semi-automated pipeline for constructing homograph-focused datasets, introduce the HomoRich dataset, and demonstrate its effectiveness.<n>We improve one of the most well-known rule-based G2P systems, eSpeak, into a fast homograph-aware version, HomoFast eSpeak.
arXiv Detail & Related papers (2025-05-19T11:11:12Z)
Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation. We introduce novel methodologies and datasets to overcome these challenges. We propose MhBART, an encoder-decoder model designed to emulate human writing style. We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z)
Grammar Induction from Visual, Speech and Text [91.98797120799227]
This work introduces a novel visual-audio-text grammar induction task (textbfVAT-GI) Inspired by the fact that language grammar exists beyond the texts, we argue that the text has not to be the predominant modality in grammar induction. We propose a visual-audio-text inside-outside autoencoder (textbfVaTiora) framework, which leverages rich modal-specific and complementary features for effective grammar parsing.
arXiv Detail & Related papers (2024-10-01T02:24:18Z)
Towards Unsupervised Speech Recognition Without Pronunciation Models [57.222729245842054]
In this article, we tackle the challenge of developing ASR systems without paired speech and text corpora. We experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling. This innovative model surpasses the performance of previous unsupervised ASR models under the lexicon-free setting.
arXiv Detail & Related papers (2024-06-12T16:30:58Z)
Multilingual context-based pronunciation learning for Text-to-Speech [13.941800219395757]
Phonetic information and linguistic knowledge are an essential component of a Text-to-speech (TTS) front-end. We showcase a multilingual unified front-end system that addresses any pronunciation related task, typically handled by separate modules. We find that the multilingual model is competitive across languages and tasks, however, some trade-offs exists when compared to equivalent monolingual solutions.
arXiv Detail & Related papers (2023-07-31T14:29:06Z)
Improving grapheme-to-phoneme conversion by learning pronunciations from speech recordings [12.669655363646257]
The Grapheme-to-Phoneme (G2P) task aims to convert orthographic input into a discrete phonetic representation. We propose a method to improve the G2P conversion task by learning pronunciation examples from audio recordings.
arXiv Detail & Related papers (2023-07-31T13:25:38Z)
The Effects of Input Type and Pronunciation Dictionary Usage in Transfer Learning for Low-Resource Text-to-Speech [1.1852406625172218]
We compare phone labels and articulatory features as input for cross-lingual transfer learning in text-to-speech for low-resource languages (LRLs) Experiments with FastSpeech 2 and the LRL West Frisian show that using articulatory features outperformed using phone labels in both intelligibility and naturalness.
arXiv Detail & Related papers (2023-06-01T10:42:56Z)
Good Neighbors Are All You Need for Chinese Grapheme-to-Phoneme Conversion [1.5020330976600735]
Most Chinese Grapheme-to-Phoneme (G2P) systems employ a three-stage framework that first transforms input sequences into character embeddings, obtains linguistic information using language models, and then predicts the phonemes based on global context. We propose the Reinforcer that provides strong inductive bias for language models by emphasizing the phonological information between neighboring characters to help disambiguate pronunciations.
arXiv Detail & Related papers (2023-03-14T09:15:51Z)
MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech Recognition [75.12948999653338]
We propose a novel multi-task encoder-decoder pre-training framework (MMSpeech) for Mandarin automatic speech recognition (ASR) We employ a multi-task learning framework including five self-supervised and supervised tasks with speech and text data. Experiments on AISHELL-1 show that our proposed method achieves state-of-the-art performance, with a more than 40% relative improvement compared with other pre-training methods.
arXiv Detail & Related papers (2022-11-29T13:16:09Z)
SoundChoice: Grapheme-to-Phoneme Models with Semantic Disambiguation [10.016862617549991]
This paper proposes SoundChoice, a novel Grapheme-to-Phoneme (G2P) architecture that processes entire sentences rather than operating at the word level. SoundChoice achieves a Phoneme Error Rate (PER) of 2.65% on whole-sentence transcription using data from LibriSpeech and Wikipedia.
arXiv Detail & Related papers (2022-07-27T01:14:59Z)
Finstreder: Simple and fast Spoken Language Understanding with Finite State Transducers using modern Speech-to-Text models [69.35569554213679]
In Spoken Language Understanding (SLU) the task is to extract important information from audio commands. This paper presents a simple method for embedding intents and entities into Finite State Transducers.
arXiv Detail & Related papers (2022-06-29T12:49:53Z)
Few-Shot Cross-Lingual TTS Using Transferable Phoneme Embedding [55.989376102986654]
This paper studies a transferable phoneme embedding framework that aims to deal with the cross-lingual text-to-speech problem under the few-shot setting. We propose a framework that consists of a phoneme-based TTS model and a codebook module to project phonemes from different languages into a learned latent space.
arXiv Detail & Related papers (2022-06-27T11:24:40Z)
Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech [88.22544315633687]
Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech systems. We propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary. Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy.
arXiv Detail & Related papers (2022-06-05T10:50:34Z)
Infusing Finetuning with Semantic Dependencies [62.37697048781823]
We show that, unlike syntax, semantics is not brought to the surface by today's pretrained models. We then use convolutional graph encoders to explicitly incorporate semantic parses into task-specific finetuning.
arXiv Detail & Related papers (2020-12-10T01:27:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.