Improving Speech Translation by Cross-Modal Multi-Grained Contrastive
Learning
- URL: http://arxiv.org/abs/2304.10309v1
- Date: Thu, 20 Apr 2023 13:41:56 GMT
- Title: Improving Speech Translation by Cross-Modal Multi-Grained Contrastive
Learning
- Authors: Hao Zhang, Nianwen Si, Yaqi Chen, Wenlin Zhang, Xukui Yang, Dan Qu,
and Wei-Qiang Zhang
- Abstract summary: We propose the FCCL (Fine- and Coarse- Granularity Contrastive Learning) approach for E2E-ST.
A key ingredient of our approach is applying contrastive learning at both sentence- and frame-level to give the comprehensive guide for extracting speech representations containing rich semantic information.
Experiments on the MuST-C benchmark show that our proposed approach significantly outperforms the state-of-the-art E2E-ST baselines on all eight language pairs.
- Score: 8.501945512734268
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The end-to-end speech translation (E2E-ST) model has gradually become a
mainstream paradigm due to its low latency and less error propagation. However,
it is non-trivial to train such a model well due to the task complexity and
data scarcity. The speech-and-text modality differences result in the E2E-ST
model performance usually inferior to the corresponding machine translation
(MT) model. Based on the above observation, existing methods often use
sharingmechanisms to carry out implicit knowledge transfer by imposing various
constraints. However, the final model often performs worse on the MT task than
the MT model trained alone, which means that the knowledge transfer ability of
this method is also limited. To deal with these problems, we propose the FCCL
(Fine- and Coarse- Granularity Contrastive Learning) approach for E2E-ST, which
makes explicit knowledge transfer through cross-modal multi-grained contrastive
learning. A key ingredient of our approach is applying contrastive learning at
both sentence- and frame-level to give the comprehensive guide for extracting
speech representations containing rich semantic information.In addition, we
adopt a simple whitening method to alleviate the representation degeneration in
the MT model, which adversely affects contrast learning. Experiments on the
MuST-C benchmark show that our proposed approach significantly outperforms the
state-of-the-art E2E-ST baselines on all eight language pairs. Further analysis
indicates that FCCL can free up its capacity from learning grammatical
structure information and force more layers to learn semantic information.
Related papers
- Advancing Multimodal Large Language Models with Quantization-Aware Scale Learning for Efficient Adaptation [70.22782550540714]
Quantization-aware Scale LeArning method based on multimodal Warmup, termed QSLAW.
We introduce a Quantization-aware Scale LeArning method based on multimodal Warmup, termed QSLAW.
arXiv Detail & Related papers (2024-08-07T12:42:09Z) - Dude: Dual Distribution-Aware Context Prompt Learning For Large Vision-Language Model [27.56988000960972]
We introduce a new framework based on a dual context of both domain-shared and class-specific contexts.
Such dual prompt methods enhance the model's feature representation by joining implicit and explicit factors encoded in Large Language Models.
We also formulate the Unbalanced Optimal Transport (UOT) theory to quantify the relationships between constructed prompts and visual tokens.
arXiv Detail & Related papers (2024-07-05T13:15:29Z) - MT-PATCHER: Selective and Extendable Knowledge Distillation from Large Language Models for Machine Translation [61.65537912700187]
Large Language Models (LLM) have demonstrated their strong ability in the field of machine translation (MT)
We propose a framework called MT-Patcher, which transfers knowledge from LLMs to existing MT models in a selective, comprehensive and proactive manner.
arXiv Detail & Related papers (2024-03-14T16:07:39Z) - Expedited Training of Visual Conditioned Language Generation via
Redundancy Reduction [61.16125290912494]
$textEVL_textGen$ is a framework designed for the pre-training of visually conditioned language generation models.
We show that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance.
arXiv Detail & Related papers (2023-10-05T03:40:06Z) - Understanding and Bridging the Modality Gap for Speech Translation [11.13240570688547]
Multi-task learning is one of the effective ways to share knowledge between machine translation (MT) and end-to-end speech translation (ST)
However, due to the differences between speech and text, there is always a gap between ST and MT.
In this paper, we first aim to understand this modality gap from the target-side representation differences, and link the modality gap to another well-known problem in neural machine translation: exposure bias.
arXiv Detail & Related papers (2023-05-15T15:09:18Z) - Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP)
What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining.
How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z) - Model-Agnostic Multitask Fine-tuning for Few-shot Vision-Language
Transfer Learning [59.38343286807997]
We propose Model-Agnostic Multitask Fine-tuning (MAMF) for vision-language models on unseen tasks.
Compared with model-agnostic meta-learning (MAML), MAMF discards the bi-level optimization and uses only first-order gradients.
We show that MAMF consistently outperforms the classical fine-tuning method for few-shot transfer learning on five benchmark datasets.
arXiv Detail & Related papers (2022-03-09T17:26:53Z) - Tight Integrated End-to-End Training for Cascaded Speech Translation [40.76367623739673]
A cascaded speech translation model relies on discrete and non-differentiable transcription.
Direct speech translation is an alternative method to avoid error propagation.
This work explores the feasibility of collapsing the entire cascade components into a single end-to-end trainable model.
arXiv Detail & Related papers (2020-11-24T15:43:49Z) - Towards Accurate Knowledge Transfer via Target-awareness Representation
Disentanglement [56.40587594647692]
We propose a novel transfer learning algorithm, introducing the idea of Target-awareness REpresentation Disentanglement (TRED)
TRED disentangles the relevant knowledge with respect to the target task from the original source model and used as a regularizer during fine-tuning the target model.
Experiments on various real world datasets show that our method stably improves the standard fine-tuning by more than 2% in average.
arXiv Detail & Related papers (2020-10-16T17:45:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.