A Survey of Vision-Language Pre-training from the Lens of Multimodal
Machine Translation
- URL: http://arxiv.org/abs/2306.07198v1
- Date: Mon, 12 Jun 2023 15:56:10 GMT
- Title: A Survey of Vision-Language Pre-training from the Lens of Multimodal
Machine Translation
- Authors: Jeremy Gwinnup and Kevin Duh
- Abstract summary: This paper surveys the landscape of language-and-vision pre-training from the lens of multimodal machine translation.
We summarize the common architectures, pre-training objectives, and datasets from literature and conjecture what further is needed to make progress on multimodal machine translation.
- Score: 13.426403221815063
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models such as BERT and the GPT series started a paradigm
shift that calls for building general-purpose models via pre-training on large
datasets, followed by fine-tuning on task-specific datasets. There is now a
plethora of large pre-trained models for Natural Language Processing and
Computer Vision. Recently, we have seen rapid developments in the joint
Vision-Language space as well, where pre-trained models such as CLIP (Radford
et al., 2021) have demonstrated improvements in downstream tasks like image
captioning and visual question answering. However, surprisingly there is
comparatively little work on exploring these models for the task of multimodal
machine translation, where the goal is to leverage image/video modality in
text-to-text translation. To fill this gap, this paper surveys the landscape of
language-and-vision pre-training from the lens of multimodal machine
translation. We summarize the common architectures, pre-training objectives,
and datasets from literature and conjecture what further is needed to make
progress on multimodal machine translation.
Related papers
- CLIPTrans: Transferring Visual Knowledge with Pre-trained Models for
Multimodal Machine Translation [31.911593690549633]
multimodal machine translation (MMT) systems enhance neural machine translation (NMT) with visual knowledge.
Previous works face a challenge in training powerful MMT models from scratch due to the scarcity of annotated multilingual vision-language data.
We propose CLIPTrans, which simply adapts the independently pre-trained multimodal M-CLIP and the multilingual mBART.
arXiv Detail & Related papers (2023-08-29T11:29:43Z) - RC3: Regularized Contrastive Cross-lingual Cross-modal Pre-training [84.23022072347821]
We propose a regularized cross-lingual visio-textual contrastive learning objective that constrains the representation proximity of weakly-aligned visio-textual inputs.
Experiments on 5 downstream multi-modal tasks across 6 languages demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2023-05-13T14:41:05Z) - Vision-and-Language Pretraining [19.903012955284698]
This article provides a comprehensive revision of contemporary V&L pretraining models.
In particular, we categorize and delineate pretraining approaches, along with the summary of state-of-the-art vision-and-language pretrained models.
arXiv Detail & Related papers (2022-07-05T02:18:49Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Unsupervised Vision-and-Language Pre-training via Retrieval-based
Multi-Granular Alignment [66.77841319057299]
We propose a novel unsupervised Vision-and-Language pre-training curriculum for non-parallel texts and images.
We first construct a weakly aligned image-text corpus via a retrieval-based approach, then apply a set of multi-granular alignment pre-training tasks.
A comprehensive ablation study shows each granularity is helpful to learn a stronger pre-trained model.
arXiv Detail & Related papers (2022-03-01T05:34:01Z) - A Survey of Vision-Language Pre-Trained Models [41.323956143107644]
Pre-trained models have advanced at a breakneck pace in recent years.
How to adapt pre-training to the field of Vision-and-Language learning and improve the performance on downstream tasks becomes a focus of multimodal learning.
arXiv Detail & Related papers (2022-02-18T15:15:46Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z) - Cross-lingual Visual Pre-training for Multimodal Machine Translation [36.4592103797139]
We combine cross-lingual and visual pre-training methods to learn cross-lingual representations.
We show that when fine-tuned for multimodal machine translation, these models obtain state-of-the-art performance.
arXiv Detail & Related papers (2021-01-25T12:46:41Z) - Unsupervised Vision-and-Language Pre-training Without Parallel Images
and Captions [92.47566804182338]
We investigate if a strong V&L representation model can be learned through unsupervised pre-training without image-caption corpora.
In particular, we propose to conduct mask-and-predict'' pre-training on text-only and image-only corpora.
We find that such a simple approach performance close to a model pre-trained with aligned data, on four English V&L benchmarks.
arXiv Detail & Related papers (2020-10-24T08:17:54Z) - Behind the Scene: Revealing the Secrets of Pre-trained
Vision-and-Language Models [65.19308052012858]
Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research.
We present VALUE, a set of meticulously designed probing tasks to decipher the inner workings of multimodal pre-training.
Key observations: Pre-trained models exhibit a propensity for attending over text rather than images during inference.
arXiv Detail & Related papers (2020-05-15T01:06:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.