M3P: Learning Universal Representations via Multitask Multilingual
Multimodal Pre-training
- URL: http://arxiv.org/abs/2006.02635v4
- Date: Thu, 1 Apr 2021 03:43:53 GMT
- Title: M3P: Learning Universal Representations via Multitask Multilingual
Multimodal Pre-training
- Authors: Minheng Ni, Haoyang Huang, Lin Su, Edward Cui, Taroon Bharti, Lijuan
Wang, Jianfeng Gao, Dongdong Zhang and Nan Duan
- Abstract summary: M3P is a Multilingual Multimodal Pre-trained model that combines multilingual pre-training and multimodal pre-training.
We show that M3P can achieve comparable results for English and new state-of-the-art results for non-English languages.
- Score: 119.16007395162431
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present M3P, a Multitask Multilingual Multimodal Pre-trained model that
combines multilingual pre-training and multimodal pre-training into a unified
framework via multitask pre-training. Our goal is to learn universal
representations that can map objects occurred in different modalities or texts
expressed in different languages into a common semantic space. In addition, to
explicitly encourage fine-grained alignment between images and non-English
languages, we also propose Multimodal Code-switched Training (MCT) to combine
monolingual pre-training and multimodal pre-training via a code-switch
strategy. Experiments are performed on the multilingual image retrieval task
across two benchmark datasets, including MSCOCO and Multi30K. M3P can achieve
comparable results for English and new state-of-the-art results for non-English
languages.
Related papers
- m3P: Towards Multimodal Multilingual Translation with Multimodal Prompt [39.2728779674405]
We propose a framework to leverage the multimodal prompt to guide the Multimodal Multilingual neural Machine Translation (m3P)
Our method aims to minimize the representation distance of different languages by regarding the image as a central language.
Experimental results show that m3P outperforms previous text-only baselines and multilingual multimodal methods by a large margin.
arXiv Detail & Related papers (2024-03-26T10:04:24Z) - LVP-M3: Language-aware Visual Prompt for Multilingual Multimodal Machine
Translation [94.33019040320507]
Multimodal Machine Translation (MMT) focuses on enhancing text-only translation with visual features.
Recent advances still struggle to train a separate model for each language pair, which is costly and unaffordable when the number of languages increases.
We propose the Multilingual MMT task by establishing two new Multilingual MMT benchmark datasets covering seven languages.
arXiv Detail & Related papers (2022-10-19T12:21:39Z) - Generalizing Multimodal Pre-training into Multilingual via Language
Acquisition [54.69707237195554]
English-based Vision-Language Pre-training has achieved great success in various downstream tasks.
Some efforts have been taken to generalize this success to non-English languages through Multilingual Vision-Language Pre-training.
We propose a textbfMultitextbfLingual textbfAcquisition (MLA) framework that can easily generalize a monolingual Vision-Language Pre-training model into multilingual.
arXiv Detail & Related papers (2022-05-29T08:53:22Z) - Large-scale Bilingual Language-Image Contrastive Learning [17.19890778916312]
We collect 1.1 billion image-text pairs (708 million Korean and 476 million English) and train a bilingual multimodal model named KELIP.
We introduce simple yet effective training schemes, including MAE pre-training and multi-crop augmentation.
Experiments demonstrate that a model trained with such training schemes shows competitive performance in both languages.
arXiv Detail & Related papers (2022-03-28T03:02:03Z) - xGQA: Cross-Lingual Visual Question Answering [100.35229218735938]
xGQA is a new multilingual evaluation benchmark for the visual question answering task.
We extend the established English GQA dataset to 7 typologically diverse languages.
We propose new adapter-based approaches to adapt multimodal transformer-based models to become multilingual.
arXiv Detail & Related papers (2021-09-13T15:58:21Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z) - CoSDA-ML: Multi-Lingual Code-Switching Data Augmentation for Zero-Shot
Cross-Lingual NLP [68.2650714613869]
We propose a data augmentation framework to generate multi-lingual code-switching data to fine-tune mBERT.
Compared with the existing work, our method does not rely on bilingual sentences for training, and requires only one training process for multiple target languages.
arXiv Detail & Related papers (2020-06-11T13:15:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.