UNIMO: Towards Unified-Modal Understanding and Generation via
Cross-Modal Contrastive Learning
- URL: http://arxiv.org/abs/2012.15409v1
- Date: Thu, 31 Dec 2020 02:46:47 GMT
- Title: UNIMO: Towards Unified-Modal Understanding and Generation via
Cross-Modal Contrastive Learning
- Authors: Wei Li, Can Gao, Guocheng Niu, Xinyan Xiao, Hao Liu, Jiachen Liu, Hua
Wu, Haifeng Wang
- Abstract summary: We propose a unified-modal pre-training architecture, namely UNIMO, which can adapt to both single-modal and multi-modal understanding and generation tasks.
As the non-paired single-modal data is very rich, our model can utilize much larger scale of data to learn more generalizable representations.
- Score: 28.89401350391015
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existed pre-training methods either focus on single-modal tasks or
multi-modal tasks, and cannot effectively adapt to each other. They can only
utilize single-modal data (i.e. text or image) or limited multi-modal data
(i.e. image-text pairs). In this work, we propose a unified-modal pre-training
architecture, namely UNIMO, which can effectively adapt to both single-modal
and multi-modal understanding and generation tasks. Large scale of free text
corpus and image collections can be utilized to improve the capability of
visual and textual understanding, and cross-modal contrastive learning (CMCL)
is leveraged to align the textual and visual information into a unified
semantic space over a corpus of image-text pairs. As the non-paired
single-modal data is very rich, our model can utilize much larger scale of data
to learn more generalizable representations. Moreover, the textual knowledge
and visual knowledge can enhance each other in the unified semantic space. The
experimental results show that UNIMO significantly improves the performance of
several single-modal and multi-modal downstream tasks.
Related papers
- Towards More Unified In-context Visual Understanding [74.55332581979292]
We present a new ICL framework for visual understanding with multi-modal output enabled.
First, we quantize and embed both text and visual prompt into a unified representational space.
Then a decoder-only sparse transformer architecture is employed to perform generative modeling on them.
arXiv Detail & Related papers (2023-12-05T06:02:21Z) - On Uni-Modal Feature Learning in Supervised Multi-Modal Learning [21.822251958013737]
We abstract the features (i.e. learned representations) of multi-modal data into 1) uni-modal features, which can be learned from uni-modal training, and 2) paired features, which can only be learned from cross-modal interactions.
We demonstrate that, under a simple guiding strategy, we can achieve comparable results to other complex late-fusion or intermediate-fusion methods on various multi-modal datasets.
arXiv Detail & Related papers (2023-05-02T07:15:10Z) - Universal Multimodal Representation for Language Understanding [110.98786673598015]
This work presents new methods to employ visual information as assistant signals to general NLP tasks.
For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs.
Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively.
arXiv Detail & Related papers (2023-01-09T13:54:11Z) - Learning Multimodal Data Augmentation in Feature Space [65.54623807628536]
LeMDA is an easy-to-use method that automatically learns to jointly augment multimodal data in feature space.
We show that LeMDA can profoundly improve the performance of multimodal deep learning architectures.
arXiv Detail & Related papers (2022-12-29T20:39:36Z) - ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text
Pre-training [40.05046655477684]
ERNIE-ViL 2.0 is a Multi-View Contrastive learning framework to build intra-modal and inter-modal correlations between diverse views simultaneously.
We construct sequences of object tags as a special textual view to narrow the cross-modal semantic gap on noisy image-text pairs.
ERNIE-ViL 2.0 achieves competitive results on English cross-modal retrieval.
arXiv Detail & Related papers (2022-09-30T07:20:07Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z) - i-Code: An Integrative and Composable Multimodal Learning Framework [99.56065789066027]
i-Code is a self-supervised pretraining framework where users may flexibly combine the modalities of vision, speech, and language into unified and general-purpose vector representations.
The entire system is pretrained end-to-end with new objectives including masked modality unit modeling and cross-modality contrastive learning.
Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on five video understanding tasks and the GLUE NLP benchmark, improving by as much as 11%.
arXiv Detail & Related papers (2022-05-03T23:38:50Z) - UNIMO-2: End-to-End Unified Vision-Language Grounded Learning [46.914284894632]
We propose an end-to-end unified-modal pre-training framework, namely UNIMO-2.
We build a unified Transformer model to jointly learn visual representations, textual representations and semantic alignment between images and texts.
Our code and models are public at the UNIMO project page.
arXiv Detail & Related papers (2022-03-17T03:53:11Z) - TVDIM: Enhancing Image Self-Supervised Pretraining via Noisy Text Data [13.68491474904529]
We propose Text-enhanced Visual Deep InfoMax (TVDIM) to learn better visual representations.
Our core idea of self-supervised learning is to maximize the mutual information between features extracted from multiple views.
TVDIM significantly outperforms previous visual self-supervised methods when processing the same set of images.
arXiv Detail & Related papers (2021-06-03T12:36:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.