GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language
Pre-training and Open-Vocabulary Object Detection
- URL: http://arxiv.org/abs/2312.15043v1
- Date: Fri, 22 Dec 2023 20:14:55 GMT
- Title: GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language
Pre-training and Open-Vocabulary Object Detection
- Authors: Haozhan Shen, Tiancheng Zhao, Mingwei Zhu, Jianwei Yin
- Abstract summary: We propose a zero-shot method that harnesses visual grounding ability from existing models trained from image-text pairs and pure object detection data.
We demonstrate that the proposed method significantly outperforms other zero-shot methods on RefCOCO/+/g datasets.
- Score: 24.48128633414131
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual grounding, a crucial vision-language task involving the understanding
of the visual context based on the query expression, necessitates the model to
capture the interactions between objects, as well as various spatial and
attribute information. However, the annotation data of visual grounding task is
limited due to its time-consuming and labor-intensive annotation process,
resulting in the trained models being constrained from generalizing its
capability to a broader domain. To address this challenge, we propose
GroundVLP, a simple yet effective zero-shot method that harnesses visual
grounding ability from the existing models trained from image-text pairs and
pure object detection data, both of which are more conveniently obtainable and
offer a broader domain compared to visual grounding annotation data. GroundVLP
proposes a fusion mechanism that combines the heatmap from GradCAM and the
object proposals of open-vocabulary detectors. We demonstrate that the proposed
method significantly outperforms other zero-shot methods on RefCOCO/+/g
datasets, surpassing prior zero-shot state-of-the-art by approximately 28\% on
the test split of RefCOCO and RefCOCO+. Furthermore, GroundVLP performs
comparably to or even better than some non-VLP-based supervised models on the
Flickr30k entities dataset. Our code is available at
https://github.com/om-ai-lab/GroundVLP.
Related papers
- Learning to Ground VLMs without Forgetting [54.033346088090674]
We introduce LynX, a framework that equips pretrained Visual Language Models with visual grounding ability without forgetting their existing image and language understanding skills.
To train the model effectively, we generate a high-quality synthetic dataset we call SCouT, which mimics human reasoning in visual grounding.
We evaluate LynX on several object detection and visual grounding datasets, demonstrating strong performance in object detection, zero-shot localization and grounded reasoning.
arXiv Detail & Related papers (2024-10-14T13:35:47Z) - Learning Visual Grounding from Generative Vision and Language Model [29.2712567454021]
Visual grounding tasks aim to localize image regions based on natural language references.
We find that grounding knowledge already exists in generative VLM and can be elicited by proper prompting.
Our results demonstrate the promise of generative VLM to scale up visual grounding in the real world.
arXiv Detail & Related papers (2024-07-18T20:29:49Z) - Open-Vocabulary Camouflaged Object Segmentation [66.94945066779988]
We introduce a new task, open-vocabulary camouflaged object segmentation (OVCOS)
We construct a large-scale complex scene dataset (textbfOVCamo) containing 11,483 hand-selected images with fine annotations and corresponding object classes.
By integrating the guidance of class semantic knowledge and the supplement of visual structure cues from the edge and depth information, the proposed method can efficiently capture camouflaged objects.
arXiv Detail & Related papers (2023-11-19T06:00:39Z) - VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual
Grounders [31.371338262371122]
VGDiffZero is a zero-shot visual grounding framework based on text-to-image diffusion models.
We show that VGDiffZero achieves strong performance on zero-shot visual grounding.
arXiv Detail & Related papers (2023-09-03T11:32:28Z) - Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner.
We design a semantic-guided self-supervised learning model to extract high-level semantic features from images.
We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - CLIP-Count: Towards Text-Guided Zero-Shot Object Counting [32.07271723717184]
We propose CLIP-Count, the first end-to-end pipeline that estimates density maps for open-vocabulary objects with text guidance in a zero-shot manner.
To align the text embedding with dense visual features, we introduce a patch-text contrastive loss that guides the model to learn informative patch-level visual representations for dense prediction.
Our method effectively generates high-quality density maps for objects-of-interest.
arXiv Detail & Related papers (2023-05-12T08:19:39Z) - Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets.
We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models.
Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z) - Selective In-Context Data Augmentation for Intent Detection using
Pointwise V-Information [100.03188187735624]
We introduce a novel approach based on PLMs and pointwise V-information (PVI), a metric that can measure the usefulness of a datapoint for training a model.
Our method first fine-tunes a PLM on a small seed of training data and then synthesizes new datapoints - utterances that correspond to given intents.
Our method is thus able to leverage the expressive power of large language models to produce diverse training data.
arXiv Detail & Related papers (2023-02-10T07:37:49Z) - Relation-aware Instance Refinement for Weakly Supervised Visual
Grounding [44.33411132188231]
Visual grounding aims to build a correspondence between visual objects and their language entities.
We propose a novel weakly-supervised learning method that incorporates coarse-to-fine object refinement and entity relation modeling.
Experiments on two public benchmarks demonstrate the efficacy of our framework.
arXiv Detail & Related papers (2021-03-24T05:03:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.