Understanding and Bridging the Modality Gap for Speech Translation
- URL: http://arxiv.org/abs/2305.08706v1
- Date: Mon, 15 May 2023 15:09:18 GMT
- Title: Understanding and Bridging the Modality Gap for Speech Translation
- Authors: Qingkai Fang, Yang Feng
- Abstract summary: Multi-task learning is one of the effective ways to share knowledge between machine translation (MT) and end-to-end speech translation (ST)
However, due to the differences between speech and text, there is always a gap between ST and MT.
In this paper, we first aim to understand this modality gap from the target-side representation differences, and link the modality gap to another well-known problem in neural machine translation: exposure bias.
- Score: 11.13240570688547
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: How to achieve better end-to-end speech translation (ST) by leveraging (text)
machine translation (MT) data? Among various existing techniques, multi-task
learning is one of the effective ways to share knowledge between ST and MT in
which additional MT data can help to learn source-to-target mapping. However,
due to the differences between speech and text, there is always a gap between
ST and MT. In this paper, we first aim to understand this modality gap from the
target-side representation differences, and link the modality gap to another
well-known problem in neural machine translation: exposure bias. We find that
the modality gap is relatively small during training except for some difficult
cases, but keeps increasing during inference due to the cascading effect. To
address these problems, we propose the Cross-modal Regularization with
Scheduled Sampling (Cress) method. Specifically, we regularize the output
predictions of ST and MT, whose target-side contexts are derived by sampling
between ground truth words and self-generated words with a varying probability.
Furthermore, we introduce token-level adaptive training which assigns different
training weights to target tokens to handle difficult cases with large modality
gaps. Experiments and analysis show that our approach effectively bridges the
modality gap, and achieves promising results in all eight directions of the
MuST-C dataset.
Related papers
- TMT: Tri-Modal Translation between Speech, Image, and Text by Processing
Different Modalities as Different Languages [96.8603701943286]
Tri-Modal Translation (TMT) model translates between arbitrary modalities spanning speech, image, and text.
We tokenize speech and image data into discrete tokens, which provide a unified interface across modalities.
TMT outperforms single model counterparts consistently.
arXiv Detail & Related papers (2024-02-25T07:46:57Z) - Rethinking and Improving Multi-task Learning for End-to-end Speech
Translation [51.713683037303035]
We investigate the consistency between different tasks, considering different times and modules.
We find that the textual encoder primarily facilitates cross-modal conversion, but the presence of noise in speech impedes the consistency between text and speech representations.
We propose an improved multi-task learning (IMTL) approach for the ST task, which bridges the modal gap by mitigating the difference in length and representation.
arXiv Detail & Related papers (2023-11-07T08:48:46Z) - Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard
Parameter Sharing [72.56219471145232]
We propose a ST/MT multi-tasking framework with hard parameter sharing.
Our method reduces the speech-text modality gap via a pre-processing stage.
We show that our framework improves attentional encoder-decoder, Connectionist Temporal Classification (CTC), transducer, and joint CTC/attention models by an average of +0.5 BLEU.
arXiv Detail & Related papers (2023-09-27T17:48:14Z) - CMOT: Cross-modal Mixup via Optimal Transport for Speech Translation [15.139447549817483]
End-to-end speech translation (ST) is a cross-modal task.
Existing methods often try to transfer knowledge from machine translation (MT)
We propose Cross-modal Mixup via Optimal Transport CMOT to overcome the modality gap.
arXiv Detail & Related papers (2023-05-24T02:13:48Z) - Improving Speech Translation by Cross-Modal Multi-Grained Contrastive
Learning [8.501945512734268]
We propose the FCCL (Fine- and Coarse- Granularity Contrastive Learning) approach for E2E-ST.
A key ingredient of our approach is applying contrastive learning at both sentence- and frame-level to give the comprehensive guide for extracting speech representations containing rich semantic information.
Experiments on the MuST-C benchmark show that our proposed approach significantly outperforms the state-of-the-art E2E-ST baselines on all eight language pairs.
arXiv Detail & Related papers (2023-04-20T13:41:56Z) - Bridging the Gap between Language Models and Cross-Lingual Sequence
Labeling [101.74165219364264]
Large-scale cross-lingual pre-trained language models (xPLMs) have shown effectiveness in cross-lingual sequence labeling tasks.
Despite the great success, we draw an empirical observation that there is a training objective gap between pre-training and fine-tuning stages.
In this paper, we first design a pre-training task tailored for xSL named Cross-lingual Language Informative Span Masking (CLISM) to eliminate the objective gap.
Second, we present ContrAstive-Consistency Regularization (CACR), which utilizes contrastive learning to encourage the consistency between representations of input parallel
arXiv Detail & Related papers (2022-04-11T15:55:20Z) - STEMM: Self-learning with Speech-text Manifold Mixup for Speech
Translation [37.51435498386953]
We propose the Speech-TExt Manifold Mixup (STEMM) method to calibrate such discrepancy.
Experiments on MuST-C speech translation benchmark and further analysis show that our method effectively alleviates the cross-modal representation discrepancy.
arXiv Detail & Related papers (2022-03-20T01:49:53Z) - Bridging the Data Gap between Training and Inference for Unsupervised
Neural Machine Translation [49.916963624249355]
A UNMT model is trained on the pseudo parallel data with translated source, and natural source sentences in inference.
The source discrepancy between training and inference hinders the translation performance of UNMT models.
We propose an online self-training approach, which simultaneously uses the pseudo parallel data natural source, translated target to mimic the inference scenario.
arXiv Detail & Related papers (2022-03-16T04:50:27Z) - Improving Multilingual Translation by Representation and Gradient
Regularization [82.42760103045083]
We propose a joint approach to regularize NMT models at both representation-level and gradient-level.
Our results demonstrate that our approach is highly effective in both reducing off-target translation occurrences and improving zero-shot translation performance.
arXiv Detail & Related papers (2021-09-10T10:52:21Z) - Towards Multimodal Simultaneous Neural Machine Translation [28.536262015508722]
Simultaneous translation involves translating a sentence before the speaker's utterance is completed in order to realize real-time understanding.
This task is significantly more challenging than the general full sentence translation because of the shortage of input information during decoding.
We propose multimodal simultaneous neural machine translation (MSNMT), which leverages visual information as an additional modality.
arXiv Detail & Related papers (2020-04-07T08:02:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.