r-G2P: Evaluating and Enhancing Robustness of Grapheme to Phoneme
Conversion by Controlled noise introducing and Contextual information
incorporation
- URL: http://arxiv.org/abs/2202.11194v1
- Date: Mon, 21 Feb 2022 13:29:30 GMT
- Title: r-G2P: Evaluating and Enhancing Robustness of Grapheme to Phoneme
Conversion by Controlled noise introducing and Contextual information
incorporation
- Authors: Chendong Zhao, Jianzong Wang, Xiaoyang Qu, Haoqian Wang, Jing Xiao
- Abstract summary: We show that neural G2P models are extremely sensitive to orthographical variations in graphemes like spelling mistakes.
We propose three controlled noise introducing methods to synthesize noisy training data.
We incorporate the contextual information with the baseline and propose a robust training strategy to stabilize the training process.
- Score: 32.75866643254402
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Grapheme-to-phoneme (G2P) conversion is the process of converting the written
form of words to their pronunciations. It has an important role for
text-to-speech (TTS) synthesis and automatic speech recognition (ASR) systems.
In this paper, we aim to evaluate and enhance the robustness of G2P models. We
show that neural G2P models are extremely sensitive to orthographical
variations in graphemes like spelling mistakes. To solve this problem, we
propose three controlled noise introducing methods to synthesize noisy training
data. Moreover, we incorporate the contextual information with the baseline and
propose a robust training strategy to stabilize the training process. The
experimental results demonstrate that our proposed robust G2P model (r-G2P)
outperforms the baseline significantly (-2.73\% WER on Dict-based benchmarks
and -9.09\% WER on Real-world sources).
Related papers
- LLM-Powered Grapheme-to-Phoneme Conversion: Benchmark and Case Study [2.8948274245812327]
Grapheme-to-phoneme (G2P) conversion is critical in speech processing.
Large language models (LLMs) have recently demonstrated significant potential in various language tasks.
We present a benchmarking dataset designed to assess G2P performance on sentence-level phonetic challenges of the Persian language.
arXiv Detail & Related papers (2024-09-13T06:13:55Z) - Speech collage: code-switched audio generation by collaging monolingual
corpora [50.356820349870986]
Speech Collage is a method that synthesizes CS data from monolingual corpora by splicing audio segments.
We investigate the impact of generated data on speech recognition in two scenarios.
arXiv Detail & Related papers (2023-09-27T14:17:53Z) - ALIP: Adaptive Language-Image Pre-training with Synthetic Caption [78.93535202851278]
Contrastive Language-Image Pre-training (CLIP) has significantly boosted the performance of various vision-language tasks.
The presence of intrinsic noise and unmatched image-text pairs in web data can potentially affect the performance of representation learning.
We propose an Adaptive Language-Image Pre-training (ALIP), a bi-path model that integrates supervision from both raw text and synthetic caption.
arXiv Detail & Related papers (2023-08-16T15:19:52Z) - Improving grapheme-to-phoneme conversion by learning pronunciations from
speech recordings [12.669655363646257]
The Grapheme-to-Phoneme (G2P) task aims to convert orthographic input into a discrete phonetic representation.
We propose a method to improve the G2P conversion task by learning pronunciation examples from audio recordings.
arXiv Detail & Related papers (2023-07-31T13:25:38Z) - SoundChoice: Grapheme-to-Phoneme Models with Semantic Disambiguation [10.016862617549991]
This paper proposes SoundChoice, a novel Grapheme-to-Phoneme (G2P) architecture that processes entire sentences rather than operating at the word level.
SoundChoice achieves a Phoneme Error Rate (PER) of 2.65% on whole-sentence transcription using data from LibriSpeech and Wikipedia.
arXiv Detail & Related papers (2022-07-27T01:14:59Z) - SVTS: Scalable Video-to-Speech Synthesis [105.29009019733803]
We introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder.
We are the first to show intelligible results on the challenging LRS3 dataset.
arXiv Detail & Related papers (2022-05-04T13:34:07Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - Voice2Series: Reprogramming Acoustic Models for Time Series
Classification [65.94154001167608]
Voice2Series is a novel end-to-end approach that reprograms acoustic models for time series classification.
We show that V2S either outperforms or is tied with state-of-the-art methods on 20 tasks, and improves their average accuracy by 1.84%.
arXiv Detail & Related papers (2021-06-17T07:59:15Z) - One Model to Pronounce Them All: Multilingual Grapheme-to-Phoneme
Conversion With a Transformer Ensemble [0.0]
We describe a simple approach of exploiting model ensembles, based on multilingual Transformers and self-training, to develop a highly effective G2P solution for 15 languages.
Our best models achieve 14.99 word error rate (WER) and 3.30 phoneme error rate (PER), a sizeable improvement over the shared task competitive baselines.
arXiv Detail & Related papers (2020-06-23T21:28:28Z) - Transformer based Grapheme-to-Phoneme Conversion [0.9023847175654603]
In this paper, we investigate the application of transformer architecture to G2P conversion.
We compare its performance with recurrent and convolutional neural network based approaches.
The results show that transformer based G2P outperforms the convolutional-based approach in terms of word error rate.
arXiv Detail & Related papers (2020-04-14T07:48:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.