Fine-Grained Visual Prompting
- URL: http://arxiv.org/abs/2306.04356v2
- Date: Tue, 12 Dec 2023 06:36:53 GMT
- Title: Fine-Grained Visual Prompting
- Authors: Lingfeng Yang, Yueze Wang, Xiang Li, Xinlong Wang, Jian Yang
- Abstract summary: Fine-Grained Visual Prompting (FGVP) demonstrates superior performance in zero-shot comprehension of referring expressions.
It outperforms prior methods by an average margin of 3.0% to 4.6%, with a maximum improvement of 12.5% on the RefCOCO+ testA subset.
- Score: 35.032567257651515
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Vision-Language Models (VLMs), such as CLIP, have demonstrated impressive
zero-shot transfer capabilities in image-level visual perception. However,
these models have shown limited performance in instance-level tasks that demand
precise localization and recognition. Previous works have suggested that
incorporating visual prompts, such as colorful boxes or circles, can improve
the ability of models to recognize objects of interest. Nonetheless, compared
to language prompting, visual prompting designs are rarely explored. Existing
approaches, which employ coarse visual cues such as colorful boxes or circles,
often result in sub-optimal performance due to the inclusion of irrelevant and
noisy pixels. In this paper, we carefully study the visual prompting designs by
exploring more fine-grained markings, such as segmentation masks and their
variations. In addition, we introduce a new zero-shot framework that leverages
pixel-level annotations acquired from a generalist segmentation model for
fine-grained visual prompting. Consequently, our investigation reveals that a
straightforward application of blur outside the target mask, referred to as the
Blur Reverse Mask, exhibits exceptional effectiveness. This proposed prompting
strategy leverages the precise mask annotations to reduce focus on weakly
related regions while retaining spatial coherence between the target and the
surrounding background. Our Fine-Grained Visual Prompting (FGVP) demonstrates
superior performance in zero-shot comprehension of referring expressions on the
RefCOCO, RefCOCO+, and RefCOCOg benchmarks. It outperforms prior methods by an
average margin of 3.0% to 4.6%, with a maximum improvement of 12.5% on the
RefCOCO+ testA subset. Code is available at https://github.com/ylingfeng/FGVP.
Related papers
- Spatial Action Unit Cues for Interpretable Deep Facial Expression Recognition [55.97779732051921]
State-of-the-art classifiers for facial expression recognition (FER) lack interpretability, an important feature for end-users.
A new learning strategy is proposed to explicitly incorporate AU cues into classifier training, allowing to train deep interpretable models.
Our new strategy is generic, and can be applied to any deep CNN- or transformer-based classifier without requiring any architectural change or significant additional training time.
arXiv Detail & Related papers (2024-10-01T10:42:55Z) - Semantic Refocused Tuning for Open-Vocabulary Panoptic Segmentation [42.020470627552136]
Open-vocabulary panoptic segmentation is an emerging task aiming to accurately segment the image into semantically meaningful masks.
mask classification is the main performance bottleneck for open-vocab panoptic segmentation.
We propose Semantic Refocused Tuning, a novel framework that greatly enhances open-vocab panoptic segmentation.
arXiv Detail & Related papers (2024-09-24T17:50:28Z) - FrozenSeg: Harmonizing Frozen Foundation Models for Open-Vocabulary Segmentation [47.0028071183214]
FrozenSeg is designed to integrate spatial knowledge from a localization foundation model (e.g., SAM) and semantic knowledge extracted from a ViL model (e.g., CLIP)
FrozenSeg advances state-of-the-art results across various segmentation benchmarks, trained exclusively on COCO panoptic data, and tested in a zero-shot manner.
arXiv Detail & Related papers (2024-09-05T13:36:50Z) - SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation [11.243400478302771]
Referring Expression Consistency (RES) aims to provide a segmentation mask of the target object in an image referred to by the text.
We propose a weakly-supervised bootstrapping architecture for RES with several new algorithmic innovations.
arXiv Detail & Related papers (2024-07-02T16:02:25Z) - ScanFormer: Referring Expression Comprehension by Iteratively Scanning [11.95137121280909]
Referring Expression (REC) aims to localize the target objects specified by free-form natural language descriptions in images.
While state-of-the-art methods achieve impressive performance, they perform a dense perception of images, which incorporates redundant visual regions unrelated to linguistic queries.
This inspires us to explore a question: can we eliminate linguistic-irrelevant redundant visual regions to improve the efficiency of the model?
arXiv Detail & Related papers (2024-06-26T03:56:03Z) - RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition [78.97487780589574]
Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories.
This paper introduces a Retrieving And Ranking augmented method for MLLMs.
Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base.
arXiv Detail & Related papers (2024-03-20T17:59:55Z) - Language-Driven Visual Consensus for Zero-Shot Semantic Segmentation [114.72734384299476]
We propose a Language-Driven Visual Consensus (LDVC) approach, fostering improved alignment of semantic and visual information.
We leverage class embeddings as anchors due to their discrete and abstract nature, steering vision features toward class embeddings.
Our approach significantly boosts the capacity of segmentation models for unseen classes.
arXiv Detail & Related papers (2024-03-13T11:23:55Z) - Text Augmented Spatial-aware Zero-shot Referring Image Segmentation [60.84423786769453]
We introduce a Text Augmented Spatial-aware (TAS) zero-shot referring image segmentation framework.
TAS incorporates a mask proposal network for instance-level mask extraction, a text-augmented visual-text matching score for mining the image-text correlation, and a spatial for mask post-processing.
The proposed method clearly outperforms state-of-the-art zero-shot referring image segmentation methods.
arXiv Detail & Related papers (2023-10-27T10:52:50Z) - A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained
Vision-language Model [61.58071099082296]
It is unclear how to make zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation.
In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP.
Our experimental results show that this simple framework surpasses previous state-of-the-arts by a large margin.
arXiv Detail & Related papers (2021-12-29T18:56:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.