PEVL: Position-enhanced Pre-training and Prompt Tuning for
Vision-language Models
- URL: http://arxiv.org/abs/2205.11169v1
- Date: Mon, 23 May 2022 10:17:53 GMT
- Title: PEVL: Position-enhanced Pre-training and Prompt Tuning for
Vision-language Models
- Authors: Yuan Yao, Qianyu Chen, Ao Zhang, Wei Ji, Zhiyuan Liu, Tat-Seng Chua,
Maosong Sun
- Abstract summary: We introduce PEVL, which enhances the pre-training and prompt tuning of vision-language models with explicit object position modeling.
PEVL reformulates discretized object positions and language in a unified language modeling framework.
We show that PEVL enables state-of-the-art performance on position-sensitive tasks such as referring expression comprehension and phrase grounding.
- Score: 127.17675443137064
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-language pre-training (VLP) has shown impressive performance on a wide
range of cross-modal tasks, where VLP models without reliance on object
detectors are becoming the mainstream due to their superior computation
efficiency and competitive performance. However, the removal of object
detectors also deprives the capability of VLP models in explicit object
modeling, which is essential to various position-sensitive vision-language (VL)
tasks, such as referring expression comprehension and visual commonsense
reasoning. To address the challenge, we introduce PEVL that enhances the
pre-training and prompt tuning of VLP models with explicit object position
modeling. Specifically, PEVL reformulates discretized object positions and
language in a unified language modeling framework, which facilitates explicit
VL alignment during pre-training, and also enables flexible prompt tuning for
various downstream tasks. We show that PEVL enables state-of-the-art
performance of detector-free VLP models on position-sensitive tasks such as
referring expression comprehension and phrase grounding, and also improves the
performance on position-insensitive tasks with grounded inputs. We make the
data and code for this paper publicly available at
https://github.com/thunlp/PEVL.
Related papers
- Harnessing Vision-Language Pretrained Models with Temporal-Aware Adaptation for Referring Video Object Segmentation [34.37450315995176]
Current Referring Video Object (RVOS) methods typically use vision and language models pretrained independently as backbones.
We propose a temporal-aware prompt-tuning method, which adapts pretrained representations for pixel-level prediction.
Our method performs favorably against state-of-the-art algorithms and exhibits strong generalization abilities.
arXiv Detail & Related papers (2024-05-17T08:14:22Z) - Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning [59.13366859237086]
Current solutions for efficiently constructing large vision-language (VL) models follow a two-step paradigm.
We consider visual prompts as additional knowledge that facilitates language models in addressing tasks associated with visual information.
We introduce a novel approach, wherein visual prompts are memoryd with the weights of FFN for visual knowledge injection.
arXiv Detail & Related papers (2024-05-09T08:23:20Z) - MetaVL: Transferring In-Context Learning Ability From Language Models to
Vision-Language Models [74.89629463600978]
In vision-language domain, most large-scale pre-trained vision-language models do not possess the ability to conduct in-context learning.
In this paper, we study an interesting hypothesis: can we transfer the in-context learning ability from the language domain to the vision domain?
arXiv Detail & Related papers (2023-06-02T07:21:03Z) - Adapting Pre-trained Language Models to Vision-Language Tasks via
Dynamic Visual Prompting [83.21164539349273]
Pre-trained language models (PLMs) have played an increasing role in multimedia research.
In this paper, we focus on exploring PLMs as a stand-alone model for vision-language reasoning tasks.
We propose a novel transfer learning approach for PLMs, termed Dynamic Visual Prompting (DVP)
arXiv Detail & Related papers (2023-06-01T07:19:28Z) - Position-guided Text Prompt for Vision-Language Pre-training [121.15494549650548]
We propose a novel Position-guided Text Prompt (PTP) paradigm to enhance the visual grounding ability of cross-modal models trained with Vision-Language Pre-Training.
PTP reformulates the visual grounding task into a fill-in-the-blank problem given a PTP by encouraging the model to predict the objects in the given blocks or regress the blocks of a given object.
PTP achieves comparable results with object-detector based methods, and much faster inference speed since PTP discards its object detector for inference while the later cannot.
arXiv Detail & Related papers (2022-12-19T18:55:43Z) - GLIPv2: Unifying Localization and Vision-Language Understanding [161.1770269829139]
We present GLIPv2, a grounded VL understanding model, that serves both localization tasks and Vision-Language (VL) understanding tasks.
GLIPv2 unifies localization pre-training and Vision-Language Pre-training with three pre-training tasks.
We show that a single GLIPv2 model achieves near SoTA performance on various localization and understanding tasks.
arXiv Detail & Related papers (2022-06-12T20:31:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.