Related papers: Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training

Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training

URL: http://arxiv.org/abs/2206.00621v2
Date: Mon, 12 Jun 2023 12:47:16 GMT
Title: Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training
Authors: Yan Zeng, Wangchunshu Zhou, Ao Luo, Ziming Cheng, Xinsong Zhang
Abstract summary: We introduce Cross-View Language Modeling, a simple and effective pre-training framework that unifies cross-lingual and cross-modal pre-training. Our approach is motivated by a key observation that cross-lingual and cross-modal pre-training share the same goal of aligning two different views of the same object into a common semantic space. CLM is the first multi-lingual multi-modal pre-trained model that surpasses the translate-test performance of representative English vision-language models by zero-shot cross-lingual transfer.
Score: 21.017471684853987
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we introduce Cross-View Language Modeling, a simple and effective pre-training framework that unifies cross-lingual and cross-modal pre-training with shared architectures and objectives. Our approach is motivated by a key observation that cross-lingual and cross-modal pre-training share the same goal of aligning two different views of the same object into a common semantic space. To this end, the cross-view language modeling framework considers both multi-modal data (i.e., image-caption pairs) and multi-lingual data (i.e., parallel sentence pairs) as two different views of the same object, and trains the model to align the two views by maximizing the mutual information between them with conditional masked language modeling and contrastive learning. We pre-train CCLM, a Cross-lingual Cross-modal Language Model, with the cross-view language modeling framework. Empirical results on IGLUE, a multi-lingual multi-modal benchmark, and two multi-lingual image-text retrieval datasets show that while conceptually simpler, CCLM significantly outperforms the prior state-of-the-art with an average absolute improvement of over 10%. Moreover, CCLM is the first multi-lingual multi-modal pre-trained model that surpasses the translate-test performance of representative English vision-language models by zero-shot cross-lingual transfer.

Related papers

CL2CM: Improving Cross-Lingual Cross-Modal Retrieval via Cross-Lingual Knowledge Transfer [23.58317401302547]
We propose a general framework, Cross-Lingual to Cross-Modal (CL2CM), which improves the alignment between vision and target language using cross-lingual transfer. We evaluate our proposed approach on two multilingual image-text datasets, Multi30K and MSCOCO, and one video-text dataset, VATEX.
arXiv Detail & Related papers (2023-12-14T14:29:53Z)
Dual-view Curricular Optimal Transport for Cross-lingual Cross-modal Retrieval [57.98555925471121]
Cross-lingual cross-modal retrieval has attracted increasing attention. Most CCR methods construct pseudo-parallel vision-language corpora via Machine Translation. We propose Dual-view Curricular Optimal Transport (DCOT) to learn with noisy correspondence in CCR.
arXiv Detail & Related papers (2023-09-11T13:44:46Z)
RC3: Regularized Contrastive Cross-lingual Cross-modal Pre-training [84.23022072347821]
We propose a regularized cross-lingual visio-textual contrastive learning objective that constrains the representation proximity of weakly-aligned visio-textual inputs. Experiments on 5 downstream multi-modal tasks across 6 languages demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2023-05-13T14:41:05Z)
MVP: Multi-Stage Vision-Language Pre-Training via Multi-Level Semantic Alignment [24.720485548282845]
We introduce concepts in both modalities to construct two-level semantic representations for language and vision. We train the cross-modality model in two stages, namely, uni-modal learning and cross-modal learning. Our model generates the-state-of-the-art results on several vision and language tasks.
arXiv Detail & Related papers (2022-01-29T14:30:59Z)
UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning. We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM) Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z)
ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora [21.78571365050787]
ERNIE-M is a new training method that encourages the model to align the representation of multiple languages with monolingual corpora. We generate pseudo-parallel sentences pairs on a monolingual corpus to enable the learning of semantic alignment between different languages. Experimental results show that ERNIE-M outperforms existing cross-lingual models and delivers new state-of-the-art results on various cross-lingual downstream tasks.
arXiv Detail & Related papers (2020-12-31T15:52:27Z)
Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language. We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models. Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z)
Cross-lingual Spoken Language Understanding with Regularized Representation Alignment [71.53159402053392]
We propose a regularization approach to align word-level and sentence-level representations across languages without any external resource. Experiments on the cross-lingual spoken language understanding task show that our model outperforms current state-of-the-art methods in both few-shot and zero-shot scenarios.
arXiv Detail & Related papers (2020-09-30T08:56:53Z)
InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training [135.12061144759517]
We present an information-theoretic framework that formulates cross-lingual language model pre-training. We propose a new pre-training task based on contrastive learning. By leveraging both monolingual and parallel corpora, we jointly train the pretext to improve the cross-lingual transferability of pre-trained models.
arXiv Detail & Related papers (2020-07-15T16:58:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.