Related papers: Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training

Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training

URL: http://arxiv.org/abs/2403.02325v1
Date: Mon, 4 Mar 2024 18:55:30 GMT
Title: Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training
Authors: David Wan, Jaemin Cho, Elias Stengel-Eskin, Mohit Bansal
Abstract summary: We introduce Contrastive Region Guidance (CRG), a training-free guidance method that enables open-source vision-language models to respond to visual prompts. When region annotations are provided, CRG increases absolute accuracy by up to 11.1% on ViP-Bench. We also show CRG's applicability to spatial reasoning, with 10% improvement on What'sUp.
Score: 79.27663870280038
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Highlighting particularly relevant regions of an image can improve the performance of vision-language models (VLMs) on various vision-language (VL) tasks by guiding the model to attend more closely to these regions of interest. For example, VLMs can be given a "visual prompt", where visual markers such as bounding boxes delineate key image regions. However, current VLMs that can incorporate visual guidance are either proprietary and expensive or require costly training on curated data that includes visual prompts. We introduce Contrastive Region Guidance (CRG), a training-free guidance method that enables open-source VLMs to respond to visual prompts. CRG contrasts model outputs produced with and without visual prompts, factoring out biases revealed by the model when answering without the information required to produce a correct answer (i.e., the model's prior). CRG achieves substantial improvements in a wide variety of VL tasks: When region annotations are provided, CRG increases absolute accuracy by up to 11.1% on ViP-Bench, a collection of six diverse region-based tasks such as recognition, math, and object relationship reasoning. We also show CRG's applicability to spatial reasoning, with 10% improvement on What'sUp, as well as to compositional generalization -- improving accuracy by 11.5% and 7.5% on two challenging splits from SugarCrepe -- and to image-text alignment for generated images, where we improve by up to 8.4 AUROC and 6.8 F1 points on SeeTRUE. When reference regions are absent, CRG allows us to re-rank proposed regions in referring expression comprehension and phrase grounding benchmarks like RefCOCO/+/g and Flickr30K Entities, with an average gain of 3.2% in accuracy. Our analysis explores alternative masking strategies for CRG, quantifies CRG's probability shift, and evaluates the role of region guidance strength, empirically validating CRG's design choices.

Related papers

Cyclic Contrastive Knowledge Transfer for Open-Vocabulary Object Detection [11.497620257835964]
We propose CCKT-Det trained without any extra supervision. The proposed framework constructs a cyclic and dynamic knowledge transfer from language queries and visual region features extracted from vision-language models (VLMs) CCKT-Det can consistently improve performance as the scale of VLMs increases, all while requiring the detector at a moderate level of overhead.
arXiv Detail & Related papers (2025-03-14T02:04:28Z)
Locality Alignment Improves Vision-Language Models [55.275235524659905]
Vision language models (VLMs) have seen growing adoption in recent years, but many still struggle with basic spatial reasoning errors. We propose a new efficient post-training stage for ViTs called locality alignment. We show that locality-aligned backbones improve performance across a range of benchmarks.
arXiv Detail & Related papers (2024-10-14T21:01:01Z)
SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models [68.13636352687257]
We introduce Spatial Region GPT (SpatialRGPT) to enhance VLMs' spatial perception and reasoning capabilities. During inference, when provided with user-specified region proposals, SpatialRGPT can accurately perceive their relative directions and distances. Our results demonstrate that SpatialRGPT significantly enhances performance in spatial reasoning tasks, both with and without local region prompts.
arXiv Detail & Related papers (2024-06-03T17:59:06Z)
RegionGPT: Towards Region Understanding Vision Language Model [88.42271128373191]
RegionGPT (short as RGPT) is a novel framework designed for complex region-level captioning and understanding. We develop an automated region caption data generation pipeline, enriching the training set with detailed region-level captions. We demonstrate that a universal RGPT model can be effectively applied and significantly enhancing performance across a range of region-level tasks.
arXiv Detail & Related papers (2024-03-04T18:58:08Z)
Question-Answer Cross Language Image Matching for Weakly Supervised Semantic Segmentation [37.15828464616587]
Class Activation Map (CAM) has emerged as a popular tool for weakly supervised semantic segmentation. We propose a novel Question-Answer Cross-Language-Image Matching framework for WSSS (QA-CLIMS)
arXiv Detail & Related papers (2024-01-18T10:55:13Z)
Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions [24.596929878045568]
We develop methods to train vision-language models (VLMs) with "bag-level" image-text supervision. We use descriptions of categories generated by large language models (LLMs) and abundant, fine-grained image classification datasets. Our findings suggest that geographic priors can be just as effective and are complementary to visual appearance.
arXiv Detail & Related papers (2024-01-04T08:39:13Z)
Fine-Grained Visual Prompting [35.032567257651515]
Fine-Grained Visual Prompting (FGVP) demonstrates superior performance in zero-shot comprehension of referring expressions. It outperforms prior methods by an average margin of 3.0% to 4.6%, with a maximum improvement of 12.5% on the RefCOCO+ testA subset.
arXiv Detail & Related papers (2023-06-07T11:39:56Z)
PGL: Prior-Guided Local Self-supervised Learning for 3D Medical Image Segmentation [87.50205728818601]
We propose a PriorGuided Local (PGL) self-supervised model that learns the region-wise local consistency in the latent feature space. Our PGL model learns the distinctive representations of local regions, and hence is able to retain structural information.
arXiv Detail & Related papers (2020-11-25T11:03:11Z)
GINet: Graph Interaction Network for Scene Parsing [58.394591509215005]
We propose a Graph Interaction unit (GI unit) and a Semantic Context Loss (SC-loss) to promote context reasoning over image regions. The proposed GINet outperforms the state-of-the-art approaches on the popular benchmarks, including Pascal-Context and COCO Stuff.
arXiv Detail & Related papers (2020-09-14T02:52:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.