Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard
Parameter Sharing
- URL: http://arxiv.org/abs/2309.15826v1
- Date: Wed, 27 Sep 2023 17:48:14 GMT
- Title: Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard
Parameter Sharing
- Authors: Brian Yan, Xuankai Chang, Antonios Anastasopoulos, Yuya Fujita, Shinji
Watanabe
- Abstract summary: We propose a ST/MT multi-tasking framework with hard parameter sharing.
Our method reduces the speech-text modality gap via a pre-processing stage.
We show that our framework improves attentional encoder-decoder, Connectionist Temporal Classification (CTC), transducer, and joint CTC/attention models by an average of +0.5 BLEU.
- Score: 72.56219471145232
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent works in end-to-end speech-to-text translation (ST) have proposed
multi-tasking methods with soft parameter sharing which leverage machine
translation (MT) data via secondary encoders that map text inputs to an
eventual cross-modal representation. In this work, we instead propose a ST/MT
multi-tasking framework with hard parameter sharing in which all model
parameters are shared cross-modally. Our method reduces the speech-text
modality gap via a pre-processing stage which converts speech and text inputs
into two discrete token sequences of similar length -- this allows models to
indiscriminately process both modalities simply using a joint vocabulary. With
experiments on MuST-C, we demonstrate that our multi-tasking framework improves
attentional encoder-decoder, Connectionist Temporal Classification (CTC),
transducer, and joint CTC/attention models by an average of +0.5 BLEU without
any external MT data. Further, we show that this framework incorporates
external MT data, yielding +0.8 BLEU, and also improves transfer learning from
pre-trained textual models, yielding +1.8 BLEU.
Related papers
- Towards Zero-Shot Multimodal Machine Translation [64.9141931372384]
We propose a method to bypass the need for fully supervised data to train multimodal machine translation systems.
Our method, called ZeroMMT, consists in adapting a strong text-only machine translation (MT) model by training it on a mixture of two objectives.
To prove that our method generalizes to languages with no fully supervised training data available, we extend the CoMMuTE evaluation dataset to three new languages: Arabic, Russian and Chinese.
arXiv Detail & Related papers (2024-07-18T15:20:31Z) - TMT: Tri-Modal Translation between Speech, Image, and Text by Processing
Different Modalities as Different Languages [96.8603701943286]
Tri-Modal Translation (TMT) model translates between arbitrary modalities spanning speech, image, and text.
We tokenize speech and image data into discrete tokens, which provide a unified interface across modalities.
TMT outperforms single model counterparts consistently.
arXiv Detail & Related papers (2024-02-25T07:46:57Z) - Pushing the Limits of Zero-shot End-to-End Speech Translation [15.725310520335785]
Data scarcity and the modality gap between the speech and text modalities are two major obstacles of end-to-end Speech Translation (ST) systems.
We introduce ZeroSwot, a method for zero-shot ST that bridges the modality gap without any paired ST data.
Our experiments show that we can effectively close the modality gap without ST data, while our results on MuST-C and CoVoST demonstrate our method's superiority.
arXiv Detail & Related papers (2024-02-16T03:06:37Z) - MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer [106.79844459065828]
This paper presents MM-Interleaved, an end-to-end generative model for interleaved image-text data.
It introduces a multi-scale and multi-image feature synchronizer module, allowing direct access to fine-grained image features in the previous context.
Experiments demonstrate the versatility of MM-Interleaved in recognizing visual details following multi-modal instructions and generating consistent images following both textual and visual conditions.
arXiv Detail & Related papers (2024-01-18T18:50:16Z) - FLIP: Fine-grained Alignment between ID-based Models and Pretrained Language Models for CTR Prediction [49.510163437116645]
Click-through rate (CTR) prediction plays as a core function module in personalized online services.
Traditional ID-based models for CTR prediction take as inputs the one-hot encoded ID features of tabular modality.
Pretrained Language Models(PLMs) has given rise to another paradigm, which takes as inputs the sentences of textual modality.
We propose to conduct Fine-grained feature-level ALignment between ID-based Models and Pretrained Language Models(FLIP) for CTR prediction.
arXiv Detail & Related papers (2023-10-30T11:25:03Z) - Incorporating Probing Signals into Multimodal Machine Translation via
Visual Question-Answering Pairs [45.41083125321069]
multimodal machine translation (MMT) systems exhibit decreased sensitivity to visual information when text inputs are complete.
A novel approach is proposed to generate parallel Visual Question-Answering (VQA) style pairs from the source text.
An MMT-VQA multitask learning framework is introduced to incorporate explicit probing signals from the dataset into the MMT training process.
arXiv Detail & Related papers (2023-10-26T04:13:49Z) - Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T)
We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces.
Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z) - Tight Integrated End-to-End Training for Cascaded Speech Translation [40.76367623739673]
A cascaded speech translation model relies on discrete and non-differentiable transcription.
Direct speech translation is an alternative method to avoid error propagation.
This work explores the feasibility of collapsing the entire cascade components into a single end-to-end trainable model.
arXiv Detail & Related papers (2020-11-24T15:43:49Z) - ST-BERT: Cross-modal Language Model Pre-training For End-to-end Spoken
Language Understanding [23.367329217151084]
We introduce a cross-modal pre-trained language model, called Speech-Text BERT (ST-BERT), to tackle end-to-end spoken language understanding tasks.
Taking phoneme posterior and subword-level text as an input, ST-BERT learns a contextualized cross-modal alignment.
Our method shows further SLU performance gain via domain-adaptive pre-training with domain-specific speech-text pair data.
arXiv Detail & Related papers (2020-10-23T10:28:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.