UNIMO: Towards Unified-Modal Understanding and Generation via
Cross-Modal Contrastive Learning
- URL: http://arxiv.org/abs/2012.15409v1
- Date: Thu, 31 Dec 2020 02:46:47 GMT
- Title: UNIMO: Towards Unified-Modal Understanding and Generation via
Cross-Modal Contrastive Learning
- Authors: Wei Li, Can Gao, Guocheng Niu, Xinyan Xiao, Hao Liu, Jiachen Liu, Hua
Wu, Haifeng Wang
- Abstract summary: We propose a unified-modal pre-training architecture, namely UNIMO, which can adapt to both single-modal and multi-modal understanding and generation tasks.
As the non-paired single-modal data is very rich, our model can utilize much larger scale of data to learn more generalizable representations.
- Score: 28.89401350391015
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existed pre-training methods either focus on single-modal tasks or
multi-modal tasks, and cannot effectively adapt to each other. They can only
utilize single-modal data (i.e. text or image) or limited multi-modal data
(i.e. image-text pairs). In this work, we propose a unified-modal pre-training
architecture, namely UNIMO, which can effectively adapt to both single-modal
and multi-modal understanding and generation tasks. Large scale of free text
corpus and image collections can be utilized to improve the capability of
visual and textual understanding, and cross-modal contrastive learning (CMCL)
is leveraged to align the textual and visual information into a unified
semantic space over a corpus of image-text pairs. As the non-paired
single-modal data is very rich, our model can utilize much larger scale of data
to learn more generalizable representations. Moreover, the textual knowledge
and visual knowledge can enhance each other in the unified semantic space. The
experimental results show that UNIMO significantly improves the performance of
several single-modal and multi-modal downstream tasks.
Related papers
- Everything is a Video: Unifying Modalities through Next-Frame Prediction [5.720266474212221]
We introduce a novel framework that extends the concept of task reformulation beyond natural language processing (NLP) to multimodal learning.
We propose to reformulate diverse multimodal tasks into a unified next-frame prediction problem, allowing a single model to handle different modalities without modality-specific components.
Our approach is evaluated on a range of tasks, including text-to-text, image-to-text, video-to-video, video-to-text, and audio-to-text.
arXiv Detail & Related papers (2024-11-15T12:59:37Z) - NoteLLM-2: Multimodal Large Representation Models for Recommendation [60.17448025069594]
We investigate the potential of Large Language Models to enhance multimodal representation in multimodal item-to-item recommendations.
One feasible method is the transfer of Multimodal Large Language Models (MLLMs) for representation tasks.
We propose a novel training framework, NoteLLM-2, specifically designed for multimodal representation.
arXiv Detail & Related papers (2024-05-27T03:24:01Z) - Towards More Unified In-context Visual Understanding [74.55332581979292]
We present a new ICL framework for visual understanding with multi-modal output enabled.
First, we quantize and embed both text and visual prompt into a unified representational space.
Then a decoder-only sparse transformer architecture is employed to perform generative modeling on them.
arXiv Detail & Related papers (2023-12-05T06:02:21Z) - Universal Multimodal Representation for Language Understanding [110.98786673598015]
This work presents new methods to employ visual information as assistant signals to general NLP tasks.
For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs.
Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively.
arXiv Detail & Related papers (2023-01-09T13:54:11Z) - Learning Multimodal Data Augmentation in Feature Space [65.54623807628536]
LeMDA is an easy-to-use method that automatically learns to jointly augment multimodal data in feature space.
We show that LeMDA can profoundly improve the performance of multimodal deep learning architectures.
arXiv Detail & Related papers (2022-12-29T20:39:36Z) - ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text
Pre-training [40.05046655477684]
ERNIE-ViL 2.0 is a Multi-View Contrastive learning framework to build intra-modal and inter-modal correlations between diverse views simultaneously.
We construct sequences of object tags as a special textual view to narrow the cross-modal semantic gap on noisy image-text pairs.
ERNIE-ViL 2.0 achieves competitive results on English cross-modal retrieval.
arXiv Detail & Related papers (2022-09-30T07:20:07Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z) - UNIMO-2: End-to-End Unified Vision-Language Grounded Learning [46.914284894632]
We propose an end-to-end unified-modal pre-training framework, namely UNIMO-2.
We build a unified Transformer model to jointly learn visual representations, textual representations and semantic alignment between images and texts.
Our code and models are public at the UNIMO project page.
arXiv Detail & Related papers (2022-03-17T03:53:11Z) - TVDIM: Enhancing Image Self-Supervised Pretraining via Noisy Text Data [13.68491474904529]
We propose Text-enhanced Visual Deep InfoMax (TVDIM) to learn better visual representations.
Our core idea of self-supervised learning is to maximize the mutual information between features extracted from multiple views.
TVDIM significantly outperforms previous visual self-supervised methods when processing the same set of images.
arXiv Detail & Related papers (2021-06-03T12:36:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.