Fine-Grained Visual Prompting
- URL: http://arxiv.org/abs/2306.04356v2
- Date: Tue, 12 Dec 2023 06:36:53 GMT
- Title: Fine-Grained Visual Prompting
- Authors: Lingfeng Yang, Yueze Wang, Xiang Li, Xinlong Wang, Jian Yang
- Abstract summary: Fine-Grained Visual Prompting (FGVP) demonstrates superior performance in zero-shot comprehension of referring expressions.
It outperforms prior methods by an average margin of 3.0% to 4.6%, with a maximum improvement of 12.5% on the RefCOCO+ testA subset.
- Score: 35.032567257651515
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Vision-Language Models (VLMs), such as CLIP, have demonstrated impressive
zero-shot transfer capabilities in image-level visual perception. However,
these models have shown limited performance in instance-level tasks that demand
precise localization and recognition. Previous works have suggested that
incorporating visual prompts, such as colorful boxes or circles, can improve
the ability of models to recognize objects of interest. Nonetheless, compared
to language prompting, visual prompting designs are rarely explored. Existing
approaches, which employ coarse visual cues such as colorful boxes or circles,
often result in sub-optimal performance due to the inclusion of irrelevant and
noisy pixels. In this paper, we carefully study the visual prompting designs by
exploring more fine-grained markings, such as segmentation masks and their
variations. In addition, we introduce a new zero-shot framework that leverages
pixel-level annotations acquired from a generalist segmentation model for
fine-grained visual prompting. Consequently, our investigation reveals that a
straightforward application of blur outside the target mask, referred to as the
Blur Reverse Mask, exhibits exceptional effectiveness. This proposed prompting
strategy leverages the precise mask annotations to reduce focus on weakly
related regions while retaining spatial coherence between the target and the
surrounding background. Our Fine-Grained Visual Prompting (FGVP) demonstrates
superior performance in zero-shot comprehension of referring expressions on the
RefCOCO, RefCOCO+, and RefCOCOg benchmarks. It outperforms prior methods by an
average margin of 3.0% to 4.6%, with a maximum improvement of 12.5% on the
RefCOCO+ testA subset. Code is available at https://github.com/ylingfeng/FGVP.
Related papers
- SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation [11.243400478302771]
Referring Expression Consistency (RES) aims to provide a segmentation mask of the target object in an image referred to by the text.
We propose a weakly-supervised bootstrapping architecture for RES with several new algorithmic innovations.
arXiv Detail & Related papers (2024-07-02T16:02:25Z) - ScanFormer: Referring Expression Comprehension by Iteratively Scanning [11.95137121280909]
Referring Expression (REC) aims to localize the target objects specified by free-form natural language descriptions in images.
While state-of-the-art methods achieve impressive performance, they perform a dense perception of images, which incorporates redundant visual regions unrelated to linguistic queries.
This inspires us to explore a question: can we eliminate linguistic-irrelevant redundant visual regions to improve the efficiency of the model?
arXiv Detail & Related papers (2024-06-26T03:56:03Z) - RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition [78.97487780589574]
Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories.
This paper introduces a Retrieving And Ranking augmented method for MLLMs.
Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base.
arXiv Detail & Related papers (2024-03-20T17:59:55Z) - Language-Driven Visual Consensus for Zero-Shot Semantic Segmentation [114.72734384299476]
We propose a Language-Driven Visual Consensus (LDVC) approach, fostering improved alignment of semantic and visual information.
We leverage class embeddings as anchors due to their discrete and abstract nature, steering vision features toward class embeddings.
Our approach significantly boosts the capacity of segmentation models for unseen classes.
arXiv Detail & Related papers (2024-03-13T11:23:55Z) - Contrastive Region Guidance: Improving Grounding in Vision-Language
Models without Training [79.27663870280038]
We introduce Contrastive Region Guidance (CRG), a training-free guidance method that enables open-source vision-language models to respond to visual prompts.
When region annotations are provided, CRG increases absolute accuracy by up to 11.1% on ViP-Bench.
We also show CRG's applicability to spatial reasoning, with 10% improvement on What'sUp.
arXiv Detail & Related papers (2024-03-04T18:55:30Z) - Text Augmented Spatial-aware Zero-shot Referring Image Segmentation [60.84423786769453]
We introduce a Text Augmented Spatial-aware (TAS) zero-shot referring image segmentation framework.
TAS incorporates a mask proposal network for instance-level mask extraction, a text-augmented visual-text matching score for mining the image-text correlation, and a spatial for mask post-processing.
The proposed method clearly outperforms state-of-the-art zero-shot referring image segmentation methods.
arXiv Detail & Related papers (2023-10-27T10:52:50Z) - Vision-Enhanced Semantic Entity Recognition in Document Images via
Visually-Asymmetric Consistency Learning [19.28860833813788]
Existing models commonly train a visual encoder with weak cross-modal supervision signals.
We propose a novel textbfVisually-textbfAsymmetric cotextbfNsistentextbfCy textbfLearning (textscVancl) approach to capture fine-grained visual and layout features.
arXiv Detail & Related papers (2023-10-23T10:37:22Z) - MixReorg: Cross-Modal Mixed Patch Reorganization is a Good Mask Learner
for Open-World Semantic Segmentation [110.09800389100599]
We propose MixReorg, a novel and straightforward pre-training paradigm for semantic segmentation.
Our approach involves generating fine-grained patch-text pairs data by mixing image patches while preserving the correspondence between patches and text.
With MixReorg as a mask learner, conventional text-supervised semantic segmentation models can achieve highly generalizable pixel-semantic alignment ability.
arXiv Detail & Related papers (2023-08-09T09:35:16Z) - A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained
Vision-language Model [61.58071099082296]
It is unclear how to make zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation.
In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP.
Our experimental results show that this simple framework surpasses previous state-of-the-arts by a large margin.
arXiv Detail & Related papers (2021-12-29T18:56:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.