Tight Integrated End-to-End Training for Cascaded Speech Translation
- URL: http://arxiv.org/abs/2011.12167v1
- Date: Tue, 24 Nov 2020 15:43:49 GMT
- Title: Tight Integrated End-to-End Training for Cascaded Speech Translation
- Authors: Parnia Bahar, Tobias Bieschke, Ralf Schl\"uter and Hermann Ney
- Abstract summary: A cascaded speech translation model relies on discrete and non-differentiable transcription.
Direct speech translation is an alternative method to avoid error propagation.
This work explores the feasibility of collapsing the entire cascade components into a single end-to-end trainable model.
- Score: 40.76367623739673
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A cascaded speech translation model relies on discrete and non-differentiable
transcription, which provides a supervision signal from the source side and
helps the transformation between source speech and target text. Such modeling
suffers from error propagation between ASR and MT models. Direct speech
translation is an alternative method to avoid error propagation; however, its
performance is often behind the cascade system. To use an intermediate
representation and preserve the end-to-end trainability, previous studies have
proposed using two-stage models by passing the hidden vectors of the recognizer
into the decoder of the MT model and ignoring the MT encoder. This work
explores the feasibility of collapsing the entire cascade components into a
single end-to-end trainable model by optimizing all parameters of ASR and MT
models jointly without ignoring any learned parameters. It is a tightly
integrated method that passes renormalized source word posterior distributions
as a soft decision instead of one-hot vectors and enables backpropagation.
Therefore, it provides both transcriptions and translations and achieves strong
consistency between them. Our experiments on four tasks with different data
scenarios show that the model outperforms cascade models up to 1.8% in BLEU and
2.0% in TER and is superior compared to direct models.
Related papers
- Coupling Speech Encoders with Downstream Text Models [4.679869237248675]
We present a modular approach to building cascade speech translation models.
We preserve state-of-the-art speech recognition (ASR) and text translation (MT) performance for a given task.
arXiv Detail & Related papers (2024-07-24T19:29:13Z) - Autoregressive Speech Synthesis without Vector Quantization [135.4776759536272]
We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS)
MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition.
arXiv Detail & Related papers (2024-07-11T14:36:53Z) - Translatotron-V(ison): An End-to-End Model for In-Image Machine Translation [81.45400849638347]
In-image machine translation (IIMT) aims to translate an image containing texts in source language into an image containing translations in target language.
In this paper, we propose an end-to-end IIMT model consisting of four modules.
Our model achieves competitive performance compared to cascaded models with only 70.9% of parameters, and significantly outperforms the pixel-level end-to-end IIMT model.
arXiv Detail & Related papers (2024-07-03T08:15:39Z) - Pushing the Limits of Zero-shot End-to-End Speech Translation [15.725310520335785]
Data scarcity and the modality gap between the speech and text modalities are two major obstacles of end-to-end Speech Translation (ST) systems.
We introduce ZeroSwot, a method for zero-shot ST that bridges the modality gap without any paired ST data.
Our experiments show that we can effectively close the modality gap without ST data, while our results on MuST-C and CoVoST demonstrate our method's superiority.
arXiv Detail & Related papers (2024-02-16T03:06:37Z) - Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard
Parameter Sharing [72.56219471145232]
We propose a ST/MT multi-tasking framework with hard parameter sharing.
Our method reduces the speech-text modality gap via a pre-processing stage.
We show that our framework improves attentional encoder-decoder, Connectionist Temporal Classification (CTC), transducer, and joint CTC/attention models by an average of +0.5 BLEU.
arXiv Detail & Related papers (2023-09-27T17:48:14Z) - E2TIMT: Efficient and Effective Modal Adapter for Text Image Machine
Translation [40.62692548291319]
Text image machine translation (TIMT) aims to translate texts embedded in images from one source language to another target language.
Existing methods, both two-stage cascade and one-stage end-to-end architectures, suffer from different issues.
We propose an end-to-end TIMT model fully making use of the knowledge from existing OCR and MT datasets.
arXiv Detail & Related papers (2023-05-09T04:25:52Z) - Source and Target Bidirectional Knowledge Distillation for End-to-end
Speech Translation [88.78138830698173]
We focus on sequence-level knowledge distillation (SeqKD) from external text-based NMT models.
We train a bilingual E2E-ST model to predict paraphrased transcriptions as an auxiliary task with a single decoder.
arXiv Detail & Related papers (2021-04-13T19:00:51Z) - Streaming Models for Joint Speech Recognition and Translation [11.657994715914748]
We develop an end-to-end streaming ST model based on a re-translation approach and compare against standard cascading approaches.
We also introduce a novel inference method for the joint case, interleaving both transcript and translation in generation and removing the need to use separate decoders.
arXiv Detail & Related papers (2021-01-22T15:16:54Z) - DiscreTalk: Text-to-Speech as a Machine Translation Problem [52.33785857500754]
This paper proposes a new end-to-end text-to-speech (E2E-TTS) model based on neural machine translation (NMT)
The proposed model consists of two components; a non-autoregressive vector quantized variational autoencoder (VQ-VAE) model and an autoregressive Transformer-NMT model.
arXiv Detail & Related papers (2020-05-12T02:45:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.