Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages
- URL: http://arxiv.org/abs/2306.16774v1
- Date: Thu, 29 Jun 2023 08:20:57 GMT
- Title: Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages
- Authors: Yasmine Karoui, R\'emi Lebret, Negar Foroutan, Karl Aberer
- Abstract summary: We propose a simple yet efficient approach to adapt Vision-Language Pre-training to unseen languages using MPLM.
Our approach does not require image input and primarily uses machine translation, eliminating the need for target language data.
- Score: 3.3227703089509304
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-Language Pre-training (VLP) has advanced the performance of many
vision-language tasks, such as image-text retrieval, visual entailment, and
visual reasoning. The pre-training mostly utilizes lexical databases and image
queries in English. Previous work has demonstrated that the pre-training in
English does not transfer well to other languages in a zero-shot setting.
However, multilingual pre-trained language models (MPLM) have excelled at a
variety of single-modal language tasks. In this paper, we propose a simple yet
efficient approach to adapt VLP to unseen languages using MPLM. We utilize a
cross-lingual contextualized token embeddings alignment approach to train text
encoders for non-English languages. Our approach does not require image input
and primarily uses machine translation, eliminating the need for target
language data. Our evaluation across three distinct tasks (image-text
retrieval, visual entailment, and natural language visual reasoning)
demonstrates that this approach outperforms the state-of-the-art multilingual
vision-language models without requiring large parallel corpora. Our code is
available at https://github.com/Yasminekaroui/CliCoTea.
Related papers
- Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language.
Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv Detail & Related papers (2024-06-28T08:59:24Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z) - Generalizing Multimodal Pre-training into Multilingual via Language
Acquisition [54.69707237195554]
English-based Vision-Language Pre-training has achieved great success in various downstream tasks.
Some efforts have been taken to generalize this success to non-English languages through Multilingual Vision-Language Pre-training.
We propose a textbfMultitextbfLingual textbfAcquisition (MLA) framework that can easily generalize a monolingual Vision-Language Pre-training model into multilingual.
arXiv Detail & Related papers (2022-05-29T08:53:22Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z) - Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images.
"vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora.
Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.