Rethinking and Improving Multi-task Learning for End-to-end Speech
Translation
- URL: http://arxiv.org/abs/2311.03810v1
- Date: Tue, 7 Nov 2023 08:48:46 GMT
- Title: Rethinking and Improving Multi-task Learning for End-to-end Speech
Translation
- Authors: Yuhao Zhang, Chen Xu, Bei Li, Hao Chen, Tong Xiao, Chunliang Zhang,
Jingbo Zhu
- Abstract summary: We investigate the consistency between different tasks, considering different times and modules.
We find that the textual encoder primarily facilitates cross-modal conversion, but the presence of noise in speech impedes the consistency between text and speech representations.
We propose an improved multi-task learning (IMTL) approach for the ST task, which bridges the modal gap by mitigating the difference in length and representation.
- Score: 51.713683037303035
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Significant improvements in end-to-end speech translation (ST) have been
achieved through the application of multi-task learning. However, the extent to
which auxiliary tasks are highly consistent with the ST task, and how much this
approach truly helps, have not been thoroughly studied. In this paper, we
investigate the consistency between different tasks, considering different
times and modules. We find that the textual encoder primarily facilitates
cross-modal conversion, but the presence of noise in speech impedes the
consistency between text and speech representations. Furthermore, we propose an
improved multi-task learning (IMTL) approach for the ST task, which bridges the
modal gap by mitigating the difference in length and representation. We conduct
experiments on the MuST-C dataset. The results demonstrate that our method
attains state-of-the-art results. Moreover, when additional data is used, we
achieve the new SOTA result on MuST-C English to Spanish task with 20.8% of the
training time required by the current SOTA method.
Related papers
- Narrative Action Evaluation with Prompt-Guided Multimodal Interaction [60.281405999483]
Narrative action evaluation (NAE) aims to generate professional commentary that evaluates the execution of an action.
NAE is a more challenging task because it requires both narrative flexibility and evaluation rigor.
We propose a prompt-guided multimodal interaction framework to facilitate the interaction between different modalities of information.
arXiv Detail & Related papers (2024-04-22T17:55:07Z) - TMT: Tri-Modal Translation between Speech, Image, and Text by Processing
Different Modalities as Different Languages [96.8603701943286]
Tri-Modal Translation (TMT) model translates between arbitrary modalities spanning speech, image, and text.
We tokenize speech and image data into discrete tokens, which provide a unified interface across modalities.
TMT outperforms single model counterparts consistently.
arXiv Detail & Related papers (2024-02-25T07:46:57Z) - Understanding and Bridging the Modality Gap for Speech Translation [11.13240570688547]
Multi-task learning is one of the effective ways to share knowledge between machine translation (MT) and end-to-end speech translation (ST)
However, due to the differences between speech and text, there is always a gap between ST and MT.
In this paper, we first aim to understand this modality gap from the target-side representation differences, and link the modality gap to another well-known problem in neural machine translation: exposure bias.
arXiv Detail & Related papers (2023-05-15T15:09:18Z) - Effective Cross-Task Transfer Learning for Explainable Natural Language
Inference with T5 [50.574918785575655]
We compare sequential fine-tuning with a model for multi-task learning in the context of boosting performance on two tasks.
Our results show that while sequential multi-task learning can be tuned to be good at the first of two target tasks, it performs less well on the second and additionally struggles with overfitting.
arXiv Detail & Related papers (2022-10-31T13:26:08Z) - Scheduled Multi-task Learning for Neural Chat Translation [66.81525961469494]
We propose a scheduled multi-task learning framework for Neural Chat Translation (NCT)
Specifically, we devise a three-stage training framework to incorporate the large-scale in-domain chat translation data into training.
Extensive experiments in four language directions verify the effectiveness and superiority of the proposed approach.
arXiv Detail & Related papers (2022-05-08T02:57:28Z) - Towards Lifelong Learning of Multilingual Text-To-Speech Synthesis [87.75833205560406]
This work presents a lifelong learning approach to train a multilingual Text-To-Speech (TTS) system.
It does not require pooled data from all languages altogether, and thus alleviates the storage and computation burden.
arXiv Detail & Related papers (2021-10-09T07:00:38Z) - Improving Speech Translation by Understanding and Learning from the
Auxiliary Text Translation Task [26.703809355057224]
We conduct a detailed analysis to understand the impact of the auxiliary task on the primary task within the multitask learning framework.
Our analysis confirms that multitask learning tends to generate similar decoder representations from different modalities.
Inspired by these findings, we propose three methods to improve translation quality.
arXiv Detail & Related papers (2021-07-12T23:53:40Z) - Learning Shared Semantic Space for Speech-to-Text Translation [32.12445734213848]
We propose to bridge the modality gap between text machine translation (MT) and end-to-end speech translation (ST)
By projecting audio and text features to a common semantic representation, Chimera unifies MT and ST tasks.
Specifically, Chimera obtains 26.3 BLEU on EN-DE, improving the SOTA by a +2.7 BLEU margin.
arXiv Detail & Related papers (2021-05-07T07:49:56Z) - A General Multi-Task Learning Framework to Leverage Text Data for Speech
to Text Tasks [36.216979991706594]
We propose a general multi-task learning framework to leverage text data for automatic speech recognition (ASR) and speech translation (ST) tasks.
We demonstrate that representing text input as phoneme sequences can reduce the difference between speech and text inputs, and enhance the knowledge transfer from text corpora to the speech to text tasks.
arXiv Detail & Related papers (2020-10-21T22:40:43Z) - Hierarchical Multi Task Learning with Subword Contextual Embeddings for
Languages with Rich Morphology [5.5217350574838875]
Morphological information is important for many sequence labeling tasks in Natural Language Processing (NLP)
We propose using subword contextual embeddings to capture morphological information for languages with rich morphology.
Our model outperforms previous state-of-the-art models on both tasks for the Turkish language.
arXiv Detail & Related papers (2020-04-25T22:55:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.