Zero-shot Speech Translation
- URL: http://arxiv.org/abs/2107.06010v1
- Date: Tue, 13 Jul 2021 12:00:44 GMT
- Title: Zero-shot Speech Translation
- Authors: Tu Anh Dinh
- Abstract summary: Speech Translation (ST) is the task of translating speech in one language into text in another language.
End-to-end approaches use only one system to avoid propagating error, yet are difficult to employ due to data scarcity.
We explore zero-shot translation, which enables translating a pair of languages that is unseen during training.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Speech Translation (ST) is the task of translating speech in one language
into text in another language. Traditional cascaded approaches for ST, using
Automatic Speech Recognition (ASR) and Machine Translation (MT) systems, are
prone to error propagation. End-to-end approaches use only one system to avoid
propagating error, yet are difficult to employ due to data scarcity. We explore
zero-shot translation, which enables translating a pair of languages that is
unseen during training, thus avoid the use of end-to-end ST data. Zero-shot
translation has been shown to work for multilingual machine translation, yet
has not been studied for speech translation. We attempt to build zero-shot ST
models that are trained only on ASR and MT tasks but can do ST task during
inference. The challenge is that the representation of text and audio is
significantly different, thus the models learn ASR and MT tasks in different
ways, making it non-trivial to perform zero-shot. These models tend to output
the wrong language when performing zero-shot ST. We tackle the issues by
including additional training data and an auxiliary loss function that
minimizes the text-audio difference. Our experiment results and analysis show
that the methods are promising for zero-shot ST. Moreover, our methods are
particularly useful in the few-shot settings where a limited amount of ST data
is available, with improvements of up to +11.8 BLEU points compared to direct
end-to-end ST models and +3.9 BLEU points compared to ST models fine-tuned from
pre-trained ASR model.
Related papers
- Hard-Synth: Synthesizing Diverse Hard Samples for ASR using Zero-Shot TTS and LLM [48.71951982716363]
Text-to-speech (TTS) models have been widely adopted to enhance automatic speech recognition (ASR) systems.
We propose Hard- Synth, a novel ASR data augmentation method that leverages large language models (LLMs) and advanced zero-shot TTS.
Our approach employs LLMs to generate diverse in-domain text through rewriting, without relying on additional text data.
arXiv Detail & Related papers (2024-11-20T09:49:37Z) - Pushing the Limits of Zero-shot End-to-End Speech Translation [15.725310520335785]
Data scarcity and the modality gap between the speech and text modalities are two major obstacles of end-to-end Speech Translation (ST) systems.
We introduce ZeroSwot, a method for zero-shot ST that bridges the modality gap without any paired ST data.
Our experiments show that we can effectively close the modality gap without ST data, while our results on MuST-C and CoVoST demonstrate our method's superiority.
arXiv Detail & Related papers (2024-02-16T03:06:37Z) - End-to-End Speech-to-Text Translation: A Survey [0.0]
Speech-to-text translation pertains to the task of converting speech signals in a language to text in another language.
Automatic Speech Recognition (ASR), as well as Machine Translation(MT) models, play crucial roles in traditional ST translation.
arXiv Detail & Related papers (2023-12-02T07:40:32Z) - Unlikelihood Tuning on Negative Samples Amazingly Improves Zero-Shot
Translation [79.96416609433724]
Zero-shot translation (ZST) aims to translate between unseen language pairs in training data.
The common practice to guide the zero-shot language mapping during inference is to deliberately insert the source and target language IDs.
Recent studies have shown that language IDs sometimes fail to navigate the ZST task, making them suffer from the off-target problem.
arXiv Detail & Related papers (2023-09-28T17:02:36Z) - Back Translation for Speech-to-text Translation Without Transcripts [11.13240570688547]
We develop a back translation algorithm for ST (BT4ST) to synthesize pseudo ST data from monolingual target data.
To ease the challenges posed by short-to-long generation and one-to-many mapping, we introduce self-supervised discrete units.
With our synthetic ST data, we achieve an average boost of 2.3 BLEU on MuST-C En-De, En-Fr, and En-Es datasets.
arXiv Detail & Related papers (2023-05-15T15:12:40Z) - Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with
Unsupervised Text Pretraining [65.30528567491984]
This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language.
The use of text-only data allows the development of TTS systems for low-resource languages.
Evaluation results demonstrate highly intelligible zero-shot TTS with a character error rate of less than 12% for an unseen language.
arXiv Detail & Related papers (2023-01-30T00:53:50Z) - Discrete Cross-Modal Alignment Enables Zero-Shot Speech Translation [71.35243644890537]
End-to-end Speech Translation (ST) aims at translating the source language speech into target language text without generating the intermediate transcriptions.
Existing zero-shot methods fail to align the two modalities of speech and text into a shared semantic space.
We propose a novel Discrete Cross-Modal Alignment (DCMA) method that employs a shared discrete vocabulary space to accommodate and match both modalities of speech and text.
arXiv Detail & Related papers (2022-10-18T03:06:47Z) - Large-Scale Streaming End-to-End Speech Translation with Neural
Transducers [35.2855796745394]
We introduce a streaming end-to-end speech translation (ST) model to convert audio signals to texts in other languages directly.
Compared with cascaded ST that performs ASR followed by text-based machine translation (MT), the proposed Transformer transducer (TT)-based ST model drastically reduces inference latency.
We extend TT-based ST to multilingual ST, which generates texts of multiple languages at the same time.
arXiv Detail & Related papers (2022-04-11T18:18:53Z) - Tackling data scarcity in speech translation using zero-shot
multilingual machine translation techniques [12.968557512440759]
Several techniques have been proposed for zero-shot translation.
We investigate whether these ideas can be applied to speech translation, by building ST models trained on speech transcription and text translation data.
The techniques were successfully applied to few-shot ST using limited ST data, with improvements of up to +12.9 BLEU points compared to direct end-to-end ST and +3.1 BLEU points compared to ST models fine-tuned from ASR model.
arXiv Detail & Related papers (2022-01-26T20:20:59Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Towards Lifelong Learning of Multilingual Text-To-Speech Synthesis [87.75833205560406]
This work presents a lifelong learning approach to train a multilingual Text-To-Speech (TTS) system.
It does not require pooled data from all languages altogether, and thus alleviates the storage and computation burden.
arXiv Detail & Related papers (2021-10-09T07:00:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.