ERNIE-UniX2: A Unified Cross-lingual Cross-modal Framework for
Understanding and Generation
- URL: http://arxiv.org/abs/2211.04861v1
- Date: Wed, 9 Nov 2022 13:06:58 GMT
- Title: ERNIE-UniX2: A Unified Cross-lingual Cross-modal Framework for
Understanding and Generation
- Authors: Bin Shan, Yaqian Han, Weichong Yin, Shuohuan Wang, Yu Sun, Hao Tian,
Hua Wu, Haifeng Wang
- Abstract summary: ERNIE-UniX2 is a unified cross-lingual pre-training framework for both generation and understanding tasks.
ERNIE-UniX2 integrates multiple pre-training paradigms based on encoder-decoder architecture.
ERNIE-UniX2 can be seamlessly fine-tuned for varieties of generation and understanding downstream tasks.
- Score: 40.376625939658354
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent cross-lingual cross-modal works attempt to extend Vision-Language
Pre-training (VLP) models to non-English inputs and achieve impressive
performance. However, these models focus only on understanding tasks utilizing
encoder-only architecture. In this paper, we propose ERNIE-UniX2, a unified
cross-lingual cross-modal pre-training framework for both generation and
understanding tasks. ERNIE-UniX2 integrates multiple pre-training paradigms
(e.g., contrastive learning and language modeling) based on encoder-decoder
architecture and attempts to learn a better joint representation across
languages and modalities. Furthermore, ERNIE-UniX2 can be seamlessly fine-tuned
for varieties of generation and understanding downstream tasks. Pre-trained on
both multilingual text-only and image-text datasets, ERNIE-UniX2 achieves SOTA
results on various cross-lingual cross-modal generation and understanding tasks
such as multimodal machine translation and multilingual visual question
answering.
Related papers
- TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild [102.93338424976959]
We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved instruction-following capabilities.
Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model.
To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models.
arXiv Detail & Related papers (2023-09-14T15:34:01Z) - ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text
Pre-training [40.05046655477684]
ERNIE-ViL 2.0 is a Multi-View Contrastive learning framework to build intra-modal and inter-modal correlations between diverse views simultaneously.
We construct sequences of object tags as a special textual view to narrow the cross-modal semantic gap on noisy image-text pairs.
ERNIE-ViL 2.0 achieves competitive results on English cross-modal retrieval.
arXiv Detail & Related papers (2022-09-30T07:20:07Z) - Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal
Pre-training [21.017471684853987]
We introduce Cross-View Language Modeling, a simple and effective pre-training framework that unifies cross-lingual and cross-modal pre-training.
Our approach is motivated by a key observation that cross-lingual and cross-modal pre-training share the same goal of aligning two different views of the same object into a common semantic space.
CLM is the first multi-lingual multi-modal pre-trained model that surpasses the translate-test performance of representative English vision-language models by zero-shot cross-lingual transfer.
arXiv Detail & Related papers (2022-06-01T16:45:24Z) - Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular
Vision-Language Pre-training [120.91411454661741]
We present a pre-trainable Universal-DEcoder Network (Uni-EDEN) to facilitate both vision-language perception and generation.
Uni-EDEN is a two-stream Transformer based structure, consisting of three modules: object and sentence encoders that separately learns the representations of each modality.
arXiv Detail & Related papers (2022-01-11T16:15:07Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z) - VECO: Variable and Flexible Cross-lingual Pre-training for Language
Understanding and Generation [77.82373082024934]
We plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages.
It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language.
The proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark.
arXiv Detail & Related papers (2020-10-30T03:41:38Z) - InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language
Model Pre-Training [135.12061144759517]
We present an information-theoretic framework that formulates cross-lingual language model pre-training.
We propose a new pre-training task based on contrastive learning.
By leveraging both monolingual and parallel corpora, we jointly train the pretext to improve the cross-lingual transferability of pre-trained models.
arXiv Detail & Related papers (2020-07-15T16:58:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.