Code-Mixed Text to Speech Synthesis under Low-Resource Constraints
- URL: http://arxiv.org/abs/2312.01103v1
- Date: Sat, 2 Dec 2023 10:40:38 GMT
- Title: Code-Mixed Text to Speech Synthesis under Low-Resource Constraints
- Authors: Raviraj Joshi, Nikesh Garera
- Abstract summary: We describe our approaches for production quality code-mixed Hindi-English TTS systems built for e-commerce applications.
We propose a data-oriented approach by utilizing monolingual data sets in individual languages.
We show that such single script bi-lingual training without any code-mixing works well for pure code-mixed test sets.
- Score: 6.544954579068865
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-speech (TTS) systems are an important component in voice-based
e-commerce applications. These applications include end-to-end voice assistant
and customer experience (CX) voice bot. Code-mixed TTS is also relevant in
these applications since the product names are commonly described in English
while the surrounding text is in a regional language. In this work, we describe
our approaches for production quality code-mixed Hindi-English TTS systems
built for e-commerce applications. We propose a data-oriented approach by
utilizing monolingual data sets in individual languages. We leverage a
transliteration model to convert the Roman text into a common Devanagari script
and then combine both datasets for training. We show that such single script
bi-lingual training without any code-mixing works well for pure code-mixed test
sets. We further present an exhaustive evaluation of single-speaker adaptation
and multi-speaker training with Tacotron2 + Waveglow setup to show that the
former approach works better. These approaches are also coupled with transfer
learning and decoder-only fine-tuning to improve performance. We compare these
approaches with the Google TTS and report a positive CMOS score of 0.02 with
the proposed transfer learning approach. We also perform low-resource voice
adaptation experiments to show that a new voice can be onboarded with just 3
hrs of data. This highlights the importance of our pre-trained models in
resource-constrained settings. This subjective evaluation is performed on a
large number of out-of-domain pure code-mixed sentences to demonstrate the high
quality of the systems.
Related papers
- Rapid Speaker Adaptation in Low Resource Text to Speech Systems using
Synthetic Data and Transfer learning [6.544954579068865]
We propose a transfer learning approach using high-resource language data and synthetically generated data.
We employ a three-step approach to train a high-quality single-speaker TTS system in a low-resource Indian language Hindi.
arXiv Detail & Related papers (2023-12-02T10:52:00Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T)
We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces.
Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z) - Low-Resource Multilingual and Zero-Shot Multispeaker TTS [25.707717591185386]
We show that it is possible for a system to learn speaking a new language using just 5 minutes of training data.
We show the success of our proposed approach in terms of intelligibility, naturalness and similarity to target speaker.
arXiv Detail & Related papers (2022-10-21T20:03:37Z) - Language-agnostic Code-Switching in Sequence-To-Sequence Speech
Recognition [62.997667081978825]
Code-Switching (CS) is referred to the phenomenon of alternately using words and phrases from different languages.
We propose a simple yet effective data augmentation in which audio and corresponding labels of different source languages are transcribed.
We show that this augmentation can even improve the model's performance on inter-sentential language switches not seen during training by 5,03% WER.
arXiv Detail & Related papers (2022-10-17T12:15:57Z) - Few-Shot Cross-Lingual TTS Using Transferable Phoneme Embedding [55.989376102986654]
This paper studies a transferable phoneme embedding framework that aims to deal with the cross-lingual text-to-speech problem under the few-shot setting.
We propose a framework that consists of a phoneme-based TTS model and a codebook module to project phonemes from different languages into a learned latent space.
arXiv Detail & Related papers (2022-06-27T11:24:40Z) - Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language.
We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z) - One Model, Many Languages: Meta-learning for Multilingual Text-to-Speech [3.42658286826597]
We introduce an approach to multilingual speech synthesis which uses the meta-learning concept of contextual parameter generation.
Our model is shown to effectively share information across languages and according to a subjective evaluation test, it produces more natural and accurate code-switching speech than the baselines.
arXiv Detail & Related papers (2020-08-03T10:43:30Z) - IIT Gandhinagar at SemEval-2020 Task 9: Code-Mixed Sentiment
Classification Using Candidate Sentence Generation and Selection [1.2301855531996841]
Code-mixing adds to the challenge of analyzing the sentiment of the text due to the non-standard writing style.
We present a candidate sentence generation and selection based approach on top of the Bi-LSTM based neural classifier.
The proposed approach shows an improvement in the system performance as compared to the Bi-LSTM based neural classifier.
arXiv Detail & Related papers (2020-06-25T14:59:47Z) - You Do Not Need More Data: Improving End-To-End Speech Recognition by
Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.
Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.