GLIPv2: Unifying Localization and Vision-Language Understanding
- URL: http://arxiv.org/abs/2206.05836v1
- Date: Sun, 12 Jun 2022 20:31:28 GMT
- Title: GLIPv2: Unifying Localization and Vision-Language Understanding
- Authors: Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian
Harold Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq-Neng Hwang, Jianfeng Gao
- Abstract summary: We present GLIPv2, a grounded VL understanding model, that serves both localization tasks and Vision-Language (VL) understanding tasks.
GLIPv2 unifies localization pre-training and Vision-Language Pre-training with three pre-training tasks.
We show that a single GLIPv2 model achieves near SoTA performance on various localization and understanding tasks.
- Score: 161.1770269829139
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present GLIPv2, a grounded VL understanding model, that serves both
localization tasks (e.g., object detection, instance segmentation) and
Vision-Language (VL) understanding tasks (e.g., VQA, image captioning). GLIPv2
elegantly unifies localization pre-training and Vision-Language Pre-training
(VLP) with three pre-training tasks: phrase grounding as a VL reformulation of
the detection task, region-word contrastive learning as a novel region-word
level contrastive learning task, and the masked language modeling. This
unification not only simplifies the previous multi-stage VLP procedure but also
achieves mutual benefits between localization and understanding tasks.
Experimental results show that a single GLIPv2 model (all model weights are
shared) achieves near SoTA performance on various localization and
understanding tasks. The model also shows (1) strong zero-shot and few-shot
adaption performance on open-vocabulary object detection tasks and (2) superior
grounding capability on VL understanding tasks. Code will be released at
https://github.com/microsoft/GLIP.
Related papers
- Bridging Environments and Language with Rendering Functions and Vision-Language Models [7.704773649029078]
Vision-language models (VLMs) have tremendous potential for grounding language.
This paper introduces a novel decomposition of the problem of building language-conditioned agents (LCAs)
We also explore several enhancements to the speed and quality of VLM-based LCAs.
arXiv Detail & Related papers (2024-09-24T12:24:07Z) - VLM-PL: Advanced Pseudo Labeling Approach for Class Incremental Object Detection via Vision-Language Model [9.122593534510512]
We introduce Vision-Language Model assisted Pseudo-Labeling (VLM-PL)
This technique uses Vision-Language Model (VLM) to verify the correctness of pseudo ground-truths (GTs) without requiring additional model training.
VLM-PL integrates refined pseudo and real GTs from upcoming training, effectively combining new and old knowledge.
arXiv Detail & Related papers (2024-03-08T14:23:00Z) - Context-Aware Prompt Tuning for Vision-Language Model with
Dual-Alignment [15.180715595425864]
We introduce a novel method to improve the prompt learning of vision-language models by incorporating pre-trained large language models (LLMs)
With DuAl-PT, we propose to learn more context-aware prompts, benefiting from both explicit and implicit context modeling.
Empirically, DuAl-PT achieves superior performance on 11 downstream datasets on few-shot recognition and base-to-new generalization.
arXiv Detail & Related papers (2023-09-08T06:51:15Z) - MetaVL: Transferring In-Context Learning Ability From Language Models to
Vision-Language Models [74.89629463600978]
In vision-language domain, most large-scale pre-trained vision-language models do not possess the ability to conduct in-context learning.
In this paper, we study an interesting hypothesis: can we transfer the in-context learning ability from the language domain to the vision domain?
arXiv Detail & Related papers (2023-06-02T07:21:03Z) - GRILL: Grounded Vision-language Pre-training via Aligning Text and Image
Regions [92.96783800362886]
Generalization to unseen tasks is an important ability for few-shot learners to achieve better zero-/few-shot performance on diverse tasks.
We introduce GRILL, a novel VL model that can be generalized to diverse tasks including visual question answering, captioning, and grounding tasks with no or very few training instances.
arXiv Detail & Related papers (2023-05-24T03:33:21Z) - Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone [170.85076677740292]
We present FIBER (Fusion-In-the-Backbone-basedER), a new model architecture for vision-language (VL) pre-training.
Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model.
We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection.
arXiv Detail & Related papers (2022-06-15T16:41:29Z) - PEVL: Position-enhanced Pre-training and Prompt Tuning for
Vision-language Models [127.17675443137064]
We introduce PEVL, which enhances the pre-training and prompt tuning of vision-language models with explicit object position modeling.
PEVL reformulates discretized object positions and language in a unified language modeling framework.
We show that PEVL enables state-of-the-art performance on position-sensitive tasks such as referring expression comprehension and phrase grounding.
arXiv Detail & Related papers (2022-05-23T10:17:53Z) - BLIP: Bootstrapping Language-Image Pre-training for Unified
Vision-Language Understanding and Generation [86.4572981982407]
We propose BLIP, a new vision-language framework which transfers flexibly to both vision-language understanding and generation tasks.
BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones.
BLIP also demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner.
arXiv Detail & Related papers (2022-01-28T12:49:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.