Unified Multimodal Pre-training and Prompt-based Tuning for
Vision-Language Understanding and Generation
- URL: http://arxiv.org/abs/2112.05587v1
- Date: Fri, 10 Dec 2021 14:59:06 GMT
- Title: Unified Multimodal Pre-training and Prompt-based Tuning for
Vision-Language Understanding and Generation
- Authors: Tianyi Liu, Zuxuan Wu, Wenhan Xiong, Jingjing Chen, Yu-Gang Jiang
- Abstract summary: We propose Unified multimodal pre-training for both Vision-Language understanding and generation.
The proposed UniVL is capable of handling both understanding tasks and generative tasks.
Our experiments show that there is a trade-off between understanding tasks and generation tasks while using the same model.
- Score: 86.26522210882699
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most existing vision-language pre-training methods focus on understanding
tasks and use BERT-like objectives (masked language modeling and image-text
matching) during pretraining. Although they perform well in many understanding
downstream tasks, e.g., visual question answering, image-text retrieval and
visual entailment, they do not possess the ability to generate. To tackle this
problem, we propose Unified multimodal pre-training for both Vision-Language
understanding and generation (UniVL). The proposed UniVL is capable of handling
both understanding tasks and generative tasks. We augment existing pretraining
paradigms that only use random masks with causal masks, i.e., triangular masks
that mask out future tokens, such that the pre-trained models can have
autoregressive generation abilities by design. We formulate several previous
understanding tasks as a text generation task and propose to use prompt-based
method for fine-tuning on different downstream tasks. Our experiments show that
there is a trade-off between understanding tasks and generation tasks while
using the same model, and a feasible way to improve both tasks is to use more
data. Our UniVL framework attains comparable performance to recent
vision-language pre-training methods on both understanding tasks and generation
tasks. Moreover, we demostrate that prompt-based finetuning is more
data-efficient - it outperforms discriminative methods in few-shot scenarios.
Related papers
- ULTRA-DP: Unifying Graph Pre-training with Multi-task Graph Dual Prompt [67.8934749027315]
We propose a unified framework for graph hybrid pre-training which injects the task identification and position identification into GNNs.
We also propose a novel pre-training paradigm based on a group of $k$-nearest neighbors.
arXiv Detail & Related papers (2023-10-23T12:11:13Z) - Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning [49.92517970237088]
We tackle the problem of training a robot to understand multimodal prompts.
This type of task poses a major challenge to robots' capability to understand the interconnection and complementarity between vision and language signals.
We introduce an effective framework that learns a policy to perform robot manipulation with multimodal prompts.
arXiv Detail & Related papers (2023-10-14T22:24:58Z) - TransPrompt v2: A Transferable Prompting Framework for Cross-task Text
Classification [37.824031151922604]
We propose TransPrompt v2, a novel transferable prompting framework for few-shot learning across similar or distant text classification tasks.
For learning across similar tasks, we employ a multi-task meta-knowledge acquisition (MMA) procedure to train a meta-learner.
For learning across distant tasks, we inject the task type descriptions into the prompt, and capture the intra-type and inter-type prompt embeddings.
arXiv Detail & Related papers (2023-08-29T04:16:57Z) - Seeing What You Miss: Vision-Language Pre-training with Semantic
Completion Learning [22.464424641734652]
Cross-modal alignment is essential for vision-language pre-training models.
We propose a novel Semantic Completion Learning task to facilitate global-to-local alignment.
We also present a flexible vision encoder, which enables our model to perform image-text and video-text multimodal tasks simultaneously.
arXiv Detail & Related papers (2022-11-24T06:39:16Z) - Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone [170.85076677740292]
We present FIBER (Fusion-In-the-Backbone-basedER), a new model architecture for vision-language (VL) pre-training.
Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model.
We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection.
arXiv Detail & Related papers (2022-06-15T16:41:29Z) - VLM: Task-agnostic Video-Language Model Pre-training for Video
Understanding [78.28397557433544]
We present a task-agnostic multi-modal pre-training approach that can accept either video or text input, or both for a variety of end tasks.
Experimental results show strong performance across a wider range of tasks than any previous methods, often outperforming task-specific pre-training.
arXiv Detail & Related papers (2021-05-20T19:13:27Z) - Hierarchical Multitask Learning Approach for BERT [0.36525095710982913]
BERT learns embeddings by solving two tasks, which are masked language model (masked LM) and the next sentence prediction (NSP)
We adopt hierarchical multitask learning approaches for BERT pre-training.
Our results show that imposing a task hierarchy in pre-training improves the performance of embeddings.
arXiv Detail & Related papers (2020-10-17T09:23:04Z) - Pre-training Text Representations as Meta Learning [113.3361289756749]
We introduce a learning algorithm which directly optimize model's ability to learn text representations for effective learning of downstream tasks.
We show that there is an intrinsic connection between multi-task pre-training and model-agnostic meta-learning with a sequence of meta-train steps.
arXiv Detail & Related papers (2020-04-12T09:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.