CLIPTrans: Transferring Visual Knowledge with Pre-trained Models for
Multimodal Machine Translation
- URL: http://arxiv.org/abs/2308.15226v1
- Date: Tue, 29 Aug 2023 11:29:43 GMT
- Title: CLIPTrans: Transferring Visual Knowledge with Pre-trained Models for
Multimodal Machine Translation
- Authors: Devaansh Gupta, Siddhant Kharbanda, Jiawei Zhou, Wanhua Li, Hanspeter
Pfister, Donglai Wei
- Abstract summary: multimodal machine translation (MMT) systems enhance neural machine translation (NMT) with visual knowledge.
Previous works face a challenge in training powerful MMT models from scratch due to the scarcity of annotated multilingual vision-language data.
We propose CLIPTrans, which simply adapts the independently pre-trained multimodal M-CLIP and the multilingual mBART.
- Score: 31.911593690549633
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There has been a growing interest in developing multimodal machine
translation (MMT) systems that enhance neural machine translation (NMT) with
visual knowledge. This problem setup involves using images as auxiliary
information during training, and more recently, eliminating their use during
inference. Towards this end, previous works face a challenge in training
powerful MMT models from scratch due to the scarcity of annotated multilingual
vision-language data, especially for low-resource languages. Simultaneously,
there has been an influx of multilingual pre-trained models for NMT and
multimodal pre-trained models for vision-language tasks, primarily in English,
which have shown exceptional generalisation ability. However, these are not
directly applicable to MMT since they do not provide aligned multimodal
multilingual features for generative tasks. To alleviate this issue, instead of
designing complex modules for MMT, we propose CLIPTrans, which simply adapts
the independently pre-trained multimodal M-CLIP and the multilingual mBART. In
order to align their embedding spaces, mBART is conditioned on the M-CLIP
features by a prefix sequence generated through a lightweight mapping network.
We train this in a two-stage pipeline which warms up the model with image
captioning before the actual translation task. Through experiments, we
demonstrate the merits of this framework and consequently push forward the
state-of-the-art across standard benchmarks by an average of +2.67 BLEU. The
code can be found at www.github.com/devaansh100/CLIPTrans.
Related papers
- EMMeTT: Efficient Multimodal Machine Translation Training [26.295981183965566]
We propose a joint multimodal training regime of Speech-LLM to include automatic speech translation (AST)
To handle joint multimodal training, we propose a novel training framework called EMMeTT.
The resultant Multimodal Translation Model produces strong text and speech translation results at the same time.
arXiv Detail & Related papers (2024-09-20T14:03:23Z) - Towards Zero-Shot Multimodal Machine Translation [64.9141931372384]
We propose a method to bypass the need for fully supervised data to train multimodal machine translation systems.
Our method, called ZeroMMT, consists in adapting a strong text-only machine translation (MT) model by training it on a mixture of two objectives.
To prove that our method generalizes to languages with no fully supervised training data available, we extend the CoMMuTE evaluation dataset to three new languages: Arabic, Russian and Chinese.
arXiv Detail & Related papers (2024-07-18T15:20:31Z) - A Survey of Vision-Language Pre-training from the Lens of Multimodal
Machine Translation [13.426403221815063]
This paper surveys the landscape of language-and-vision pre-training from the lens of multimodal machine translation.
We summarize the common architectures, pre-training objectives, and datasets from literature and conjecture what further is needed to make progress on multimodal machine translation.
arXiv Detail & Related papers (2023-06-12T15:56:10Z) - Building Multilingual Machine Translation Systems That Serve Arbitrary
X-Y Translations [75.73028056136778]
We show how to practically build MNMT systems that serve arbitrary X-Y translation directions.
We also examine our proposed approach in an extremely large-scale data setting to accommodate practical deployment scenarios.
arXiv Detail & Related papers (2022-06-30T02:18:15Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z) - Pre-training Multilingual Neural Machine Translation by Leveraging
Alignment Information [72.2412707779571]
mRASP is an approach to pre-train a universal multilingual neural machine translation model.
We carry out experiments on 42 translation directions across a diverse setting, including low, medium, rich resource, and as well as transferring to exotic language pairs.
arXiv Detail & Related papers (2020-10-07T03:57:54Z) - Dynamic Context-guided Capsule Network for Multimodal Machine
Translation [131.37130887834667]
Multimodal machine translation (MMT) mainly focuses on enhancing text-only translation with visual features.
We propose a novel Dynamic Context-guided Capsule Network (DCCN) for MMT.
Experimental results on the Multi30K dataset of English-to-German and English-to-French translation demonstrate the superiority of DCCN.
arXiv Detail & Related papers (2020-09-04T06:18:24Z) - Multilingual Denoising Pre-training for Neural Machine Translation [132.66750663226287]
mBART is a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora.
mBART is one of the first methods for pre-training a complete sequence-to-sequence model.
arXiv Detail & Related papers (2020-01-22T18:59:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.