Optimal Multi-Task Learning at Regularization Horizon for Speech Translation Task
- URL: http://arxiv.org/abs/2509.09701v1
- Date: Thu, 04 Sep 2025 17:21:36 GMT
- Title: Optimal Multi-Task Learning at Regularization Horizon for Speech Translation Task
- Authors: JungHo Jung, Junhyun Lee,
- Abstract summary: We formulate Multi-Task Learning (MTL) from a regularization perspective.<n>We show how consistency regularization and R-drop contribute to the total regularization.<n>We introduce the optimal regularization contour in the high-dimensional space, called the regularization horizon.
- Score: 4.714127708213542
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: End-to-end speech-to-text translation typically suffers from the scarcity of paired speech-text data. One way to overcome this shortcoming is to utilize the bitext data from the Machine Translation (MT) task and perform Multi-Task Learning (MTL). In this paper, we formulate MTL from a regularization perspective and explore how sequences can be regularized within and across modalities. By thoroughly investigating the effect of consistency regularization (different modality) and R-drop (same modality), we show how they respectively contribute to the total regularization. We also demonstrate that the coefficient of MT loss serves as another source of regularization in the MTL setting. With these three sources of regularization, we introduce the optimal regularization contour in the high-dimensional space, called the regularization horizon. Experiments show that tuning the hyperparameters within the regularization horizon achieves near state-of-the-art performance on the MuST-C dataset.
Related papers
- DiBS-MTL: Transformation-Invariant Multitask Learning with Direction Oracles [20.925878778939083]
Multitask learning (MTL) algorithms typically rely on schemes that combine different task losses or their gradients through weighted averaging.<n>In doing so, a central challenge arises because task losses can be arbitrarily scaled.<n>We show that the convergence behavior of DiBS in non MTL settings is not understood.
arXiv Detail & Related papers (2025-09-28T15:57:06Z) - Traj-MLLM: Can Multimodal Large Language Models Reform Trajectory Data Mining? [16.718696916767428]
We propose textttTraj-MLLM, which is the first general framework using MLLMs for trajectory data mining.<n>textttTraj-MLLM transforms raw trajectories into interleaved image-text sequences while preserving key spatial-temporal characteristics.<n>Experiments on four publicly available datasets show that textttTraj-MLLM outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2025-08-25T06:45:34Z) - Optimal Transport Regularization for Speech Text Alignment in Spoken Language Models [46.76139085979338]
OTReg is a method that formulates speech-text alignment as an optimal transport problem and derives a regularization loss to improve SLM training.<n> OTReg is lightweight, requiring no additional labels or learnable parameters, and integrates seamlessly into existing SLM training procedures.
arXiv Detail & Related papers (2025-08-11T16:06:04Z) - Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard
Parameter Sharing [72.56219471145232]
We propose a ST/MT multi-tasking framework with hard parameter sharing.
Our method reduces the speech-text modality gap via a pre-processing stage.
We show that our framework improves attentional encoder-decoder, Connectionist Temporal Classification (CTC), transducer, and joint CTC/attention models by an average of +0.5 BLEU.
arXiv Detail & Related papers (2023-09-27T17:48:14Z) - Understanding and Bridging the Modality Gap for Speech Translation [11.13240570688547]
Multi-task learning is one of the effective ways to share knowledge between machine translation (MT) and end-to-end speech translation (ST)
However, due to the differences between speech and text, there is always a gap between ST and MT.
In this paper, we first aim to understand this modality gap from the target-side representation differences, and link the modality gap to another well-known problem in neural machine translation: exposure bias.
arXiv Detail & Related papers (2023-05-15T15:09:18Z) - Beyond Triplet: Leveraging the Most Data for Multimodal Machine
Translation [53.342921374639346]
Multimodal machine translation aims to improve translation quality by incorporating information from other modalities, such as vision.
Previous MMT systems mainly focus on better access and use of visual information and tend to validate their methods on image-related datasets.
This paper establishes new methods and new datasets for MMT.
arXiv Detail & Related papers (2022-12-20T15:02:38Z) - M3ST: Mix at Three Levels for Speech Translation [66.71994367650461]
We propose Mix at three levels for Speech Translation (M3ST) method to increase the diversity of the augmented training corpus.
In the first stage of fine-tuning, we mix the training corpus at three levels, including word level, sentence level and frame level, and fine-tune the entire model with mixed data.
Experiments on MuST-C speech translation benchmark and analysis show that M3ST outperforms current strong baselines and achieves state-of-the-art results on eight directions with an average BLEU of 29.9.
arXiv Detail & Related papers (2022-12-07T14:22:00Z) - Learning to Generalize to More: Continuous Semantic Augmentation for
Neural Machine Translation [50.54059385277964]
We present a novel data augmentation paradigm termed Continuous Semantic Augmentation (CsaNMT)
CsaNMT augments each training instance with an adjacency region that could cover adequate variants of literal expression under the same meaning.
arXiv Detail & Related papers (2022-04-14T08:16:28Z) - STEMM: Self-learning with Speech-text Manifold Mixup for Speech
Translation [37.51435498386953]
We propose the Speech-TExt Manifold Mixup (STEMM) method to calibrate such discrepancy.
Experiments on MuST-C speech translation benchmark and further analysis show that our method effectively alleviates the cross-modal representation discrepancy.
arXiv Detail & Related papers (2022-03-20T01:49:53Z) - Gaussian Multi-head Attention for Simultaneous Machine Translation [21.03142288187605]
Simultaneous machine translation (SiMT) outputs translation while receiving the streaming source inputs.
We propose a new SiMT policy by modeling alignment and translation in a unified manner.
Experiments on En-Vi and De-En tasks show that our method outperforms strong baselines on the trade-off between translation and latency.
arXiv Detail & Related papers (2022-03-17T04:01:25Z) - Improving Multilingual Translation by Representation and Gradient
Regularization [82.42760103045083]
We propose a joint approach to regularize NMT models at both representation-level and gradient-level.
Our results demonstrate that our approach is highly effective in both reducing off-target translation occurrences and improving zero-shot translation performance.
arXiv Detail & Related papers (2021-09-10T10:52:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.