Position-guided Text Prompt for Vision-Language Pre-training
- URL: http://arxiv.org/abs/2212.09737v2
- Date: Wed, 7 Jun 2023 06:28:18 GMT
- Title: Position-guided Text Prompt for Vision-Language Pre-training
- Authors: Alex Jinpeng Wang, Pan Zhou, Mike Zheng Shou, Shuicheng Yan
- Abstract summary: We propose a novel Position-guided Text Prompt (PTP) paradigm to enhance the visual grounding ability of cross-modal models trained with Vision-Language Pre-Training.
PTP reformulates the visual grounding task into a fill-in-the-blank problem given a PTP by encouraging the model to predict the objects in the given blocks or regress the blocks of a given object.
PTP achieves comparable results with object-detector based methods, and much faster inference speed since PTP discards its object detector for inference while the later cannot.
- Score: 121.15494549650548
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-Language Pre-Training (VLP) has shown promising capabilities to align
image and text pairs, facilitating a broad variety of cross-modal learning
tasks. However, we observe that VLP models often lack the visual
grounding/localization capability which is critical for many downstream tasks
such as visual reasoning. In this work, we propose a novel Position-guided Text
Prompt (PTP) paradigm to enhance the visual grounding ability of cross-modal
models trained with VLP. Specifically, in the VLP phase, PTP divides the image
into $N\times N$ blocks, and identifies the objects in each block through the
widely used object detector in VLP. It then reformulates the visual grounding
task into a fill-in-the-blank problem given a PTP by encouraging the model to
predict the objects in the given blocks or regress the blocks of a given
object, e.g. filling `P" or ``O" in aPTP ``The block P has a O". This mechanism
improves the visual grounding capability of VLP models and thus helps them
better handle various downstream tasks. By introducing PTP into several
state-of-the-art VLP frameworks, we observe consistently significant
improvements across representative cross-modal learning model architectures and
several benchmarks, e.g. zero-shot Flickr30K Retrieval (+4.8 in average
recall@1) for ViLT \cite{vilt} baseline, and COCO Captioning (+5.3 in CIDEr)
for SOTA BLIP \cite{blip} baseline. Moreover, PTP achieves comparable results
with object-detector based methods, and much faster inference speed since PTP
discards its object detector for inference while the later cannot. Our code and
pre-trained weight will be released at \url{https://github.com/sail-sg/ptp}.
Related papers
- Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language Pre-training Models [32.83187649097727]
We build a dataset of over four million multi-view image-text pairs across more than 100K objects.
We design a novel fine-tuning framework named Omniview-Tuning (OVT)
OVT introduces a Cross-Viewpoint Alignment objective through a minimax-like optimization strategy.
arXiv Detail & Related papers (2024-04-18T12:41:33Z) - Parameter and Computation Efficient Transfer Learning for
Vision-Language Pre-trained Models [79.34513906324727]
In this paper, we aim at parameter and efficient transfer learning (PCETL) for vision-language pre-trained models.
We propose a novel dynamic architecture skipping (DAS) approach towards effective PCETL.
arXiv Detail & Related papers (2023-09-04T09:34:33Z) - Exploiting the Textual Potential from Vision-Language Pre-training for
Text-based Person Search [17.360982091304137]
Text-based Person Search (TPS) is targeted on retrieving pedestrians to match text descriptions instead of query images.
Recent Vision-Language Pre-training models can bring transferable knowledge to downstream TPS tasks, resulting in more efficient performance gains.
However, existing TPS methods only utilize pre-trained visual encoders, neglecting the corresponding textual representation.
arXiv Detail & Related papers (2023-03-08T10:41:22Z) - Probing Cross-modal Semantics Alignment Capability from the Textual
Perspective [52.52870614418373]
Aligning cross-modal semantics is claimed to be one of the essential capabilities of vision and language pre-training models.
We propose a new probing method that is based on image captioning to first empirically study the cross-modal semantics alignment of fjord models.
arXiv Detail & Related papers (2022-10-18T02:55:58Z) - Towards a Unified View on Visual Parameter-Efficient Transfer Learning [96.99924127527002]
We propose a framework with a unified view called visual-PETL (V-PETL) to investigate the different aspects affecting the trade-off.
An effective scheme Swin-BAPAT derived from the proposed V-PETL framework achieves significantly better performance than the state-of-the-art AdaptFormer-Swin.
arXiv Detail & Related papers (2022-10-03T09:54:39Z) - VL-CheckList: Evaluating Pre-trained Vision-Language Models with
Objects, Attributes and Relations [28.322824790738768]
Vision-Language Pretraining models have successfully facilitated many cross-modal downstream tasks.
Most existing works evaluated their systems by comparing the fine-tuned downstream task performance.
Inspired by the CheckList for testing natural language processing, we exploit VL-CheckList, a novel framework.
arXiv Detail & Related papers (2022-07-01T06:25:53Z) - PEVL: Position-enhanced Pre-training and Prompt Tuning for
Vision-language Models [127.17675443137064]
We introduce PEVL, which enhances the pre-training and prompt tuning of vision-language models with explicit object position modeling.
PEVL reformulates discretized object positions and language in a unified language modeling framework.
We show that PEVL enables state-of-the-art performance on position-sensitive tasks such as referring expression comprehension and phrase grounding.
arXiv Detail & Related papers (2022-05-23T10:17:53Z) - PnP-DETR: Towards Efficient Visual Analysis with Transformers [146.55679348493587]
Recently, DETR pioneered the solution vision tasks with transformers, it directly translates the image feature map into the object result.
Recent transformer-based image recognition model andTT show consistent efficiency gain.
arXiv Detail & Related papers (2021-09-15T01:10:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.