Generating Synthetic Speech from SpokenVocab for Speech Translation
- URL: http://arxiv.org/abs/2210.08174v1
- Date: Sat, 15 Oct 2022 03:07:44 GMT
- Title: Generating Synthetic Speech from SpokenVocab for Speech Translation
- Authors: Jinming Zhao, Gholamreza Haffar, Ehsan Shareghi
- Abstract summary: Training end-to-end speech translation systems requires sufficiently large-scale data.
One practical solution is to convert machine translation data (MT) to ST data via text-to-speech (TTS) systems.
We propose a simple, scalable and effective data augmentation technique, i.e., SpokenVocab, to convert MT data to ST data on-the-fly.
- Score: 18.525896864903416
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training end-to-end speech translation (ST) systems requires sufficiently
large-scale data, which is unavailable for most language pairs and domains. One
practical solution to the data scarcity issue is to convert machine translation
data (MT) to ST data via text-to-speech (TTS) systems. Yet, using TTS systems
can be tedious and slow, as the conversion needs to be done for each MT
dataset. In this work, we propose a simple, scalable and effective data
augmentation technique, i.e., SpokenVocab, to convert MT data to ST data
on-the-fly. The idea is to retrieve and stitch audio snippets from a
SpokenVocab bank according to words in an MT sequence. Our experiments on
multiple language pairs from Must-C show that this method outperforms strong
baselines by an average of 1.83 BLEU scores, and it performs equally well as
TTS-generated speech. We also showcase how SpokenVocab can be applied in
code-switching ST for which often no TTS systems exit. Our code is available at
https://github.com/mingzi151/SpokenVocab
Related papers
- Text-To-Speech Synthesis In The Wild [76.71096751337888]
Text-to-speech (TTS) systems are traditionally trained using modest databases of studio-quality, prompted or read speech collected in benign acoustic environments such as anechoic rooms.
We introduce the TTS In the Wild (TITW) dataset, the result of a fully automated pipeline, applied to the VoxCeleb1 dataset commonly used for speaker recognition.
We show that a number of recent TTS models can be trained successfully using TITW-Easy, but that it remains extremely challenging to produce similar results using TITW-Hard.
arXiv Detail & Related papers (2024-09-13T10:58:55Z) - Pushing the Limits of Zero-shot End-to-End Speech Translation [15.725310520335785]
Data scarcity and the modality gap between the speech and text modalities are two major obstacles of end-to-end Speech Translation (ST) systems.
We introduce ZeroSwot, a method for zero-shot ST that bridges the modality gap without any paired ST data.
Our experiments show that we can effectively close the modality gap without ST data, while our results on MuST-C and CoVoST demonstrate our method's superiority.
arXiv Detail & Related papers (2024-02-16T03:06:37Z) - Selective Data Augmentation for Robust Speech Translation [17.56859840101276]
We propose an e2e architecture for English-Hindi (en-hi) ST.
We use two imperfect machine translation (MT) services to translate Libri-trans en text into hi text.
We show that this results in better ST (BLEU) score compared to brute force augmentation of MT data.
arXiv Detail & Related papers (2023-03-22T19:36:07Z) - Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with
Unsupervised Text Pretraining [65.30528567491984]
This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language.
The use of text-only data allows the development of TTS systems for low-resource languages.
Evaluation results demonstrate highly intelligible zero-shot TTS with a character error rate of less than 12% for an unseen language.
arXiv Detail & Related papers (2023-01-30T00:53:50Z) - Discrete Cross-Modal Alignment Enables Zero-Shot Speech Translation [71.35243644890537]
End-to-end Speech Translation (ST) aims at translating the source language speech into target language text without generating the intermediate transcriptions.
Existing zero-shot methods fail to align the two modalities of speech and text into a shared semantic space.
We propose a novel Discrete Cross-Modal Alignment (DCMA) method that employs a shared discrete vocabulary space to accommodate and match both modalities of speech and text.
arXiv Detail & Related papers (2022-10-18T03:06:47Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language.
We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z) - AdaSpeech 2: Adaptive Text to Speech with Untranscribed Data [115.38309338462588]
We develop AdaSpeech 2, an adaptive TTS system that only leverages untranscribed speech data for adaptation.
Specifically, we introduce a mel-spectrogram encoder to a well-trained TTS model to conduct speech reconstruction.
In adaptation, we use untranscribed speech data for speech reconstruction and only fine-tune the TTS decoder.
arXiv Detail & Related papers (2021-04-20T01:53:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.