Related papers: UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning

UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning

URL: http://arxiv.org/abs/2012.15409v1
Date: Thu, 31 Dec 2020 02:46:47 GMT
Title: UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning
Authors: Wei Li, Can Gao, Guocheng Niu, Xinyan Xiao, Hao Liu, Jiachen Liu, Hua Wu, Haifeng Wang
Abstract summary: We propose a unified-modal pre-training architecture, namely UNIMO, which can adapt to both single-modal and multi-modal understanding and generation tasks. As the non-paired single-modal data is very rich, our model can utilize much larger scale of data to learn more generalizable representations.
Score: 28.89401350391015
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existed pre-training methods either focus on single-modal tasks or multi-modal tasks, and cannot effectively adapt to each other. They can only utilize single-modal data (i.e. text or image) or limited multi-modal data (i.e. image-text pairs). In this work, we propose a unified-modal pre-training architecture, namely UNIMO, which can effectively adapt to both single-modal and multi-modal understanding and generation tasks. Large scale of free text corpus and image collections can be utilized to improve the capability of visual and textual understanding, and cross-modal contrastive learning (CMCL) is leveraged to align the textual and visual information into a unified semantic space over a corpus of image-text pairs. As the non-paired single-modal data is very rich, our model can utilize much larger scale of data to learn more generalizable representations. Moreover, the textual knowledge and visual knowledge can enhance each other in the unified semantic space. The experimental results show that UNIMO significantly improves the performance of several single-modal and multi-modal downstream tasks.

Related papers

True Multimodal In-Context Learning Needs Attention to the Visual Context [69.63677595066012]
Multimodal Large Language Models (MLLMs) have enabled Multimodal In-Context Learning (MICL)-adapting to new tasks.<n>Current MLLMs tend to neglect visual cues and over-rely on textual patterns, leading to mere text imitation rather than genuine multimodal adaptation.<n>We introduce Dynamic Attention Reallocation (DARA), an efficient fine-tuning strategy that encourages models to attend to the visual context.
arXiv Detail & Related papers (2025-07-21T17:08:18Z)
Unified Multimodal Understanding via Byte-Pair Visual Encoding [34.96534298857146]
Multimodal large language models (MLLMs) have made significant progress in vision-language understanding.<n>We present a framework that unifies multimodal understanding by applying byte-pair encoding to visual tokens.
arXiv Detail & Related papers (2025-06-30T09:08:08Z)
MLAN: Language-Based Instruction Tuning Preserves and Transfers Knowledge in Multimodal Language Models [79.0546136194314]
We present a novel visual instruction tuning strategy to improve the zero-shot task generalization of multimodal large language models.<n>We find that simply increasing sufficiently diverse text-only data enables transfer of instruction following ability and domain knowledge across modalities while being more efficient than the vision-language approach.
arXiv Detail & Related papers (2024-11-15T20:09:59Z)
Everything is a Video: Unifying Modalities through Next-Frame Prediction [5.720266474212221]
We introduce a novel framework that extends the concept of task reformulation beyond natural language processing (NLP) to multimodal learning. We propose to reformulate diverse multimodal tasks into a unified next-frame prediction problem, allowing a single model to handle different modalities without modality-specific components. Our approach is evaluated on a range of tasks, including text-to-text, image-to-text, video-to-video, video-to-text, and audio-to-text.
arXiv Detail & Related papers (2024-11-15T12:59:37Z)
NoteLLM-2: Multimodal Large Representation Models for Recommendation [60.17448025069594]
We investigate the potential of Large Language Models to enhance multimodal representation in multimodal item-to-item recommendations. One feasible method is the transfer of Multimodal Large Language Models (MLLMs) for representation tasks. We propose a novel training framework, NoteLLM-2, specifically designed for multimodal representation.
arXiv Detail & Related papers (2024-05-27T03:24:01Z)
Towards More Unified In-context Visual Understanding [74.55332581979292]
We present a new ICL framework for visual understanding with multi-modal output enabled. First, we quantize and embed both text and visual prompt into a unified representational space. Then a decoder-only sparse transformer architecture is employed to perform generative modeling on them.
arXiv Detail & Related papers (2023-12-05T06:02:21Z)
Universal Multimodal Representation for Language Understanding [110.98786673598015]
This work presents new methods to employ visual information as assistant signals to general NLP tasks. For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs. Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively.
arXiv Detail & Related papers (2023-01-09T13:54:11Z)
Learning Multimodal Data Augmentation in Feature Space [65.54623807628536]
LeMDA is an easy-to-use method that automatically learns to jointly augment multimodal data in feature space. We show that LeMDA can profoundly improve the performance of multimodal deep learning architectures.
arXiv Detail & Related papers (2022-12-29T20:39:36Z)
ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training [40.05046655477684]
ERNIE-ViL 2.0 is a Multi-View Contrastive learning framework to build intra-modal and inter-modal correlations between diverse views simultaneously. We construct sequences of object tags as a special textual view to narrow the cross-modal semantic gap on noisy image-text pairs. ERNIE-ViL 2.0 achieves competitive results on English cross-modal retrieval.
arXiv Detail & Related papers (2022-09-30T07:20:07Z)
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation. It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z)
UNIMO-2: End-to-End Unified Vision-Language Grounded Learning [46.914284894632]
We propose an end-to-end unified-modal pre-training framework, namely UNIMO-2. We build a unified Transformer model to jointly learn visual representations, textual representations and semantic alignment between images and texts. Our code and models are public at the UNIMO project page.
arXiv Detail & Related papers (2022-03-17T03:53:11Z)
TVDIM: Enhancing Image Self-Supervised Pretraining via Noisy Text Data [13.68491474904529]
We propose Text-enhanced Visual Deep InfoMax (TVDIM) to learn better visual representations. Our core idea of self-supervised learning is to maximize the mutual information between features extracted from multiple views. TVDIM significantly outperforms previous visual self-supervised methods when processing the same set of images.
arXiv Detail & Related papers (2021-06-03T12:36:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.