Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal
Pre-training
- URL: http://arxiv.org/abs/2206.00621v2
- Date: Mon, 12 Jun 2023 12:47:16 GMT
- Title: Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal
Pre-training
- Authors: Yan Zeng, Wangchunshu Zhou, Ao Luo, Ziming Cheng, Xinsong Zhang
- Abstract summary: We introduce Cross-View Language Modeling, a simple and effective pre-training framework that unifies cross-lingual and cross-modal pre-training.
Our approach is motivated by a key observation that cross-lingual and cross-modal pre-training share the same goal of aligning two different views of the same object into a common semantic space.
CLM is the first multi-lingual multi-modal pre-trained model that surpasses the translate-test performance of representative English vision-language models by zero-shot cross-lingual transfer.
- Score: 21.017471684853987
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we introduce Cross-View Language Modeling, a simple and
effective pre-training framework that unifies cross-lingual and cross-modal
pre-training with shared architectures and objectives. Our approach is
motivated by a key observation that cross-lingual and cross-modal pre-training
share the same goal of aligning two different views of the same object into a
common semantic space. To this end, the cross-view language modeling framework
considers both multi-modal data (i.e., image-caption pairs) and multi-lingual
data (i.e., parallel sentence pairs) as two different views of the same object,
and trains the model to align the two views by maximizing the mutual
information between them with conditional masked language modeling and
contrastive learning. We pre-train CCLM, a Cross-lingual Cross-modal Language
Model, with the cross-view language modeling framework. Empirical results on
IGLUE, a multi-lingual multi-modal benchmark, and two multi-lingual image-text
retrieval datasets show that while conceptually simpler, CCLM significantly
outperforms the prior state-of-the-art with an average absolute improvement of
over 10%. Moreover, CCLM is the first multi-lingual multi-modal pre-trained
model that surpasses the translate-test performance of representative English
vision-language models by zero-shot cross-lingual transfer.
Related papers
- CL2CM: Improving Cross-Lingual Cross-Modal Retrieval via Cross-Lingual
Knowledge Transfer [23.58317401302547]
We propose a general framework, Cross-Lingual to Cross-Modal (CL2CM), which improves the alignment between vision and target language using cross-lingual transfer.
We evaluate our proposed approach on two multilingual image-text datasets, Multi30K and MSCOCO, and one video-text dataset, VATEX.
arXiv Detail & Related papers (2023-12-14T14:29:53Z) - Dual-view Curricular Optimal Transport for Cross-lingual Cross-modal
Retrieval [57.98555925471121]
Cross-lingual cross-modal retrieval has attracted increasing attention.
Most CCR methods construct pseudo-parallel vision-language corpora via Machine Translation.
We propose Dual-view Curricular Optimal Transport (DCOT) to learn with noisy correspondence in CCR.
arXiv Detail & Related papers (2023-09-11T13:44:46Z) - RC3: Regularized Contrastive Cross-lingual Cross-modal Pre-training [84.23022072347821]
We propose a regularized cross-lingual visio-textual contrastive learning objective that constrains the representation proximity of weakly-aligned visio-textual inputs.
Experiments on 5 downstream multi-modal tasks across 6 languages demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2023-05-13T14:41:05Z) - MVP: Multi-Stage Vision-Language Pre-Training via Multi-Level Semantic
Alignment [24.720485548282845]
We introduce concepts in both modalities to construct two-level semantic representations for language and vision.
We train the cross-modality model in two stages, namely, uni-modal learning and cross-modal learning.
Our model generates the-state-of-the-art results on several vision and language tasks.
arXiv Detail & Related papers (2022-01-29T14:30:59Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z) - ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual
Semantics with Monolingual Corpora [21.78571365050787]
ERNIE-M is a new training method that encourages the model to align the representation of multiple languages with monolingual corpora.
We generate pseudo-parallel sentences pairs on a monolingual corpus to enable the learning of semantic alignment between different languages.
Experimental results show that ERNIE-M outperforms existing cross-lingual models and delivers new state-of-the-art results on various cross-lingual downstream tasks.
arXiv Detail & Related papers (2020-12-31T15:52:27Z) - Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language.
We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models.
Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z) - Cross-lingual Spoken Language Understanding with Regularized
Representation Alignment [71.53159402053392]
We propose a regularization approach to align word-level and sentence-level representations across languages without any external resource.
Experiments on the cross-lingual spoken language understanding task show that our model outperforms current state-of-the-art methods in both few-shot and zero-shot scenarios.
arXiv Detail & Related papers (2020-09-30T08:56:53Z) - InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language
Model Pre-Training [135.12061144759517]
We present an information-theoretic framework that formulates cross-lingual language model pre-training.
We propose a new pre-training task based on contrastive learning.
By leveraging both monolingual and parallel corpora, we jointly train the pretext to improve the cross-lingual transferability of pre-trained models.
arXiv Detail & Related papers (2020-07-15T16:58:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.