Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual
Transfer of Vision-Language Models
- URL: http://arxiv.org/abs/2103.08849v2
- Date: Thu, 18 Mar 2021 17:40:09 GMT
- Title: Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual
Transfer of Vision-Language Models
- Authors: Po-Yao Huang, Mandela Patrick, Junjie Hu, Graham Neubig, Florian Metze
and Alexander Hauptmann
- Abstract summary: We study zero-shot cross-lingual transfer of vision-language models.
We propose a Transformer-based model that learns contextualized multilingual multimodal embeddings.
- Score: 144.85290716246533
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper studies zero-shot cross-lingual transfer of vision-language
models. Specifically, we focus on multilingual text-to-video search and propose
a Transformer-based model that learns contextualized multilingual multimodal
embeddings. Under a zero-shot setting, we empirically demonstrate that
performance degrades significantly when we query the multilingual text-video
model with non-English sentences. To address this problem, we introduce a
multilingual multimodal pre-training strategy, and collect a new multilingual
instructional video dataset (MultiHowTo100M) for pre-training. Experiments on
VTT show that our method significantly improves video search in non-English
languages without additional annotations. Furthermore, when multilingual
annotations are available, our method outperforms recent baselines by a large
margin in multilingual text-to-video search on VTT and VATEX; as well as in
multilingual text-to-image search on Multi30K. Our model and Multi-HowTo100M is
available at http://github.com/berniebear/Multi-HT100M.
Related papers
- PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B.
To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training.
Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z) - Multilingual Multimodal Learning with Machine Translated Text [27.7207234512674]
We investigate whether machine translating English multimodal data can be an effective proxy for the lack of readily available multilingual data.
We propose two metrics for automatically removing such translations from the resulting datasets.
In experiments on five tasks across 20 languages in the IGLUE benchmark, we show that translated data can provide a useful signal for multilingual multimodal learning.
arXiv Detail & Related papers (2022-10-24T11:41:20Z) - C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual
Text-Video Retrieval [39.41224716332499]
We propose a Cross-Lingual Cross-Modal Knowledge Distillation method to improve multilingual text-video retrieval.
Inspired by the fact that English text-video retrieval outperforms other languages, we train a student model using input text in different languages.
We introduce a new multilingual video dataset, Multi-YouCook2, by translating the English captions in the YouCook2 video dataset to 8 other languages.
arXiv Detail & Related papers (2022-10-07T15:30:24Z) - MuMUR : Multilingual Multimodal Universal Retrieval [19.242056928318913]
We propose a framework MuMUR, that utilizes knowledge transfer from a multilingual model to boost the performance of multi-modal (image and video) retrieval.
We first use state-of-the-art machine translation models to construct pseudo ground-truth multilingual visual-text pairs.
We then use this data to learn a joint vision-text representation where English and non-English text queries are represented in a common embedding space.
arXiv Detail & Related papers (2022-08-24T13:55:15Z) - Towards Zero-shot Cross-lingual Image Retrieval and Tagging [1.4425878137951236]
We present a zero-shot approach for learning multi-modal representations using cross-lingual pre-training on the text side.
We introduce a new 1K multi-lingual MSCOCO2014 caption test dataset (XTD10) in 7 languages that we collected using a crowdsourcing platform.
We also demonstrate how a cross-lingual model can be used for downstream tasks like multi-lingual image tagging in a zero shot manner.
arXiv Detail & Related papers (2021-09-15T23:39:15Z) - xGQA: Cross-Lingual Visual Question Answering [100.35229218735938]
xGQA is a new multilingual evaluation benchmark for the visual question answering task.
We extend the established English GQA dataset to 7 typologically diverse languages.
We propose new adapter-based approaches to adapt multimodal transformer-based models to become multilingual.
arXiv Detail & Related papers (2021-09-13T15:58:21Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z) - Improving Massively Multilingual Neural Machine Translation and
Zero-Shot Translation [81.7786241489002]
Massively multilingual models for neural machine translation (NMT) are theoretically attractive, but often underperform bilingual models and deliver poor zero-shot translations.
We argue that multilingual NMT requires stronger modeling capacity to support language pairs with varying typological characteristics.
We propose random online backtranslation to enforce the translation of unseen training language pairs.
arXiv Detail & Related papers (2020-04-24T17:21:32Z) - Learning to Scale Multilingual Representations for Vision-Language Tasks [51.27839182889422]
The effectiveness of SMALR is demonstrated with ten diverse languages, over twice the number supported in vision-language tasks to date.
We evaluate on multilingual image-sentence retrieval and outperform prior work by 3-4% with less than 1/5th the training parameters compared to other word embedding methods.
arXiv Detail & Related papers (2020-04-09T01:03:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.