RC3: Regularized Contrastive Cross-lingual Cross-modal Pre-training
- URL: http://arxiv.org/abs/2305.07927v1
- Date: Sat, 13 May 2023 14:41:05 GMT
- Title: RC3: Regularized Contrastive Cross-lingual Cross-modal Pre-training
- Authors: Chulun Zhou, Yunlong Liang, Fandong Meng, Jinan Xu, Jinsong Su and Jie
Zhou
- Abstract summary: We propose a regularized cross-lingual visio-textual contrastive learning objective that constrains the representation proximity of weakly-aligned visio-textual inputs.
Experiments on 5 downstream multi-modal tasks across 6 languages demonstrate the effectiveness of our proposed method.
- Score: 84.23022072347821
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Multilingual vision-language (V&L) pre-training has achieved remarkable
progress in learning universal representations across different modalities and
languages. In spite of recent success, there still remain challenges limiting
further improvements of V&L pre-trained models in multilingual settings.
Particularly, current V&L pre-training methods rely heavily on strictly-aligned
multilingual image-text pairs generated from English-centric datasets through
machine translation. However, the cost of collecting and translating such
strictly-aligned datasets is usually unbearable. In this paper, we propose
Regularized Contrastive Cross-lingual Cross-modal (RC^3) pre-training, which
further exploits more abundant weakly-aligned multilingual image-text pairs.
Specifically, we design a regularized cross-lingual visio-textual contrastive
learning objective that constrains the representation proximity of
weakly-aligned visio-textual inputs according to textual relevance. Besides,
existing V&L pre-training approaches mainly deal with visual inputs by either
region-of-interest (ROI) features or patch embeddings. We flexibly integrate
the two forms of visual features into our model for pre-training and downstream
multi-modal tasks. Extensive experiments on 5 downstream multi-modal tasks
across 6 languages demonstrate the effectiveness of our proposed method over
competitive contrast models with stronger zero-shot capability.
Related papers
- CL2CM: Improving Cross-Lingual Cross-Modal Retrieval via Cross-Lingual
Knowledge Transfer [23.58317401302547]
We propose a general framework, Cross-Lingual to Cross-Modal (CL2CM), which improves the alignment between vision and target language using cross-lingual transfer.
We evaluate our proposed approach on two multilingual image-text datasets, Multi30K and MSCOCO, and one video-text dataset, VATEX.
arXiv Detail & Related papers (2023-12-14T14:29:53Z) - VECO 2.0: Cross-lingual Language Model Pre-training with
Multi-granularity Contrastive Learning [56.47303426167584]
We propose a cross-lingual pre-trained model VECO2.0 based on contrastive learning with multi-granularity alignments.
Specifically, the sequence-to-sequence alignment is induced to maximize the similarity of the parallel pairs and minimize the non-parallel pairs.
token-to-token alignment is integrated to bridge the gap between synonymous tokens excavated via the thesaurus dictionary from the other unpaired tokens in a bilingual instance.
arXiv Detail & Related papers (2023-04-17T12:23:41Z) - DiMBERT: Learning Vision-Language Grounded Representations with
Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language.
We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language.
We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z) - Improving the Cross-Lingual Generalisation in Visual Question Answering [40.86774711775718]
multilingual vision-language pretrained models show poor cross-lingual generalisation when applied to non-English data.
In this work, we explore the poor performance of these models on a zero-shot cross-lingual visual question answering (VQA) task.
We improve cross-lingual transfer with three strategies: (1) we introduce a linguistic prior objective to augment the cross-entropy loss with a similarity-based loss to guide the model during training, (2) we learn a task-specific subnetwork that improves cross-lingual generalisation and reduces variance without model modification, and (3) we augment training examples using synthetic code
arXiv Detail & Related papers (2022-09-07T08:07:43Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z) - WenLan: Bridging Vision and Language by Large-Scale Multi-Modal
Pre-Training [71.37731379031487]
We propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework.
Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario.
By building a large queue-based dictionary, our BriVL can incorporate more negative samples in limited GPU resources.
arXiv Detail & Related papers (2021-03-11T09:39:49Z) - Cross-lingual Visual Pre-training for Multimodal Machine Translation [36.4592103797139]
We combine cross-lingual and visual pre-training methods to learn cross-lingual representations.
We show that when fine-tuned for multimodal machine translation, these models obtain state-of-the-art performance.
arXiv Detail & Related papers (2021-01-25T12:46:41Z) - InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language
Model Pre-Training [135.12061144759517]
We present an information-theoretic framework that formulates cross-lingual language model pre-training.
We propose a new pre-training task based on contrastive learning.
By leveraging both monolingual and parallel corpora, we jointly train the pretext to improve the cross-lingual transferability of pre-trained models.
arXiv Detail & Related papers (2020-07-15T16:58:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.