Related papers: Fine-Grained Visual Prompting

Fine-Grained Visual Prompting

URL: http://arxiv.org/abs/2306.04356v2
Date: Tue, 12 Dec 2023 06:36:53 GMT
Title: Fine-Grained Visual Prompting
Authors: Lingfeng Yang, Yueze Wang, Xiang Li, Xinlong Wang, Jian Yang
Abstract summary: Fine-Grained Visual Prompting (FGVP) demonstrates superior performance in zero-shot comprehension of referring expressions. It outperforms prior methods by an average margin of 3.0% to 4.6%, with a maximum improvement of 12.5% on the RefCOCO+ testA subset.
Score: 35.032567257651515
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Vision-Language Models (VLMs), such as CLIP, have demonstrated impressive zero-shot transfer capabilities in image-level visual perception. However, these models have shown limited performance in instance-level tasks that demand precise localization and recognition. Previous works have suggested that incorporating visual prompts, such as colorful boxes or circles, can improve the ability of models to recognize objects of interest. Nonetheless, compared to language prompting, visual prompting designs are rarely explored. Existing approaches, which employ coarse visual cues such as colorful boxes or circles, often result in sub-optimal performance due to the inclusion of irrelevant and noisy pixels. In this paper, we carefully study the visual prompting designs by exploring more fine-grained markings, such as segmentation masks and their variations. In addition, we introduce a new zero-shot framework that leverages pixel-level annotations acquired from a generalist segmentation model for fine-grained visual prompting. Consequently, our investigation reveals that a straightforward application of blur outside the target mask, referred to as the Blur Reverse Mask, exhibits exceptional effectiveness. This proposed prompting strategy leverages the precise mask annotations to reduce focus on weakly related regions while retaining spatial coherence between the target and the surrounding background. Our Fine-Grained Visual Prompting (FGVP) demonstrates superior performance in zero-shot comprehension of referring expressions on the RefCOCO, RefCOCO+, and RefCOCOg benchmarks. It outperforms prior methods by an average margin of 3.0% to 4.6%, with a maximum improvement of 12.5% on the RefCOCO+ testA subset. Code is available at https://github.com/ylingfeng/FGVP.

Related papers

Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection [54.21851618853518]
We present a concise yet effective approach called Patch Generation-to-Selection to enhance CLIP's training efficiency. Our approach, CLIP-PGS, sets new state-of-the-art results in zero-shot classification and retrieval tasks.
arXiv Detail & Related papers (2025-03-21T12:10:38Z)
Reverse Region-to-Entity Annotation for Pixel-Level Visual Entity Linking [9.378011289206428]
We propose a new task, Pixel-Level Visual Entity Linking (PL-VEL) PL-VEL uses pixel masks from visual inputs to refer to objects, supplementing reference methods for VEL. This dataset contains over 5 million annotations aligning pixel-level regions with entity-level labels.
arXiv Detail & Related papers (2024-12-18T08:49:01Z)
Spatial Action Unit Cues for Interpretable Deep Facial Expression Recognition [55.97779732051921]
State-of-the-art classifiers for facial expression recognition (FER) lack interpretability, an important feature for end-users. A new learning strategy is proposed to explicitly incorporate AU cues into classifier training, allowing to train deep interpretable models. Our new strategy is generic, and can be applied to any deep CNN- or transformer-based classifier without requiring any architectural change or significant additional training time.
arXiv Detail & Related papers (2024-10-01T10:42:55Z)
Semantic Refocused Tuning for Open-Vocabulary Panoptic Segmentation [42.020470627552136]
Open-vocabulary panoptic segmentation is an emerging task aiming to accurately segment the image into semantically meaningful masks. mask classification is the main performance bottleneck for open-vocab panoptic segmentation. We propose Semantic Refocused Tuning, a novel framework that greatly enhances open-vocab panoptic segmentation.
arXiv Detail & Related papers (2024-09-24T17:50:28Z)
FrozenSeg: Harmonizing Frozen Foundation Models for Open-Vocabulary Segmentation [47.0028071183214]
FrozenSeg is designed to integrate spatial knowledge from a localization foundation model (e.g., SAM) and semantic knowledge extracted from a ViL model (e.g., CLIP) FrozenSeg advances state-of-the-art results across various segmentation benchmarks, trained exclusively on COCO panoptic data, and tested in a zero-shot manner.
arXiv Detail & Related papers (2024-09-05T13:36:50Z)
SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation [11.243400478302771]
Referring Expression Consistency (RES) aims to provide a segmentation mask of the target object in an image referred to by the text. We propose a weakly-supervised bootstrapping architecture for RES with several new algorithmic innovations.
arXiv Detail & Related papers (2024-07-02T16:02:25Z)
ScanFormer: Referring Expression Comprehension by Iteratively Scanning [11.95137121280909]
Referring Expression (REC) aims to localize the target objects specified by free-form natural language descriptions in images. While state-of-the-art methods achieve impressive performance, they perform a dense perception of images, which incorporates redundant visual regions unrelated to linguistic queries. This inspires us to explore a question: can we eliminate linguistic-irrelevant redundant visual regions to improve the efficiency of the model?
arXiv Detail & Related papers (2024-06-26T03:56:03Z)
RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition [78.97487780589574]
Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories. This paper introduces a Retrieving And Ranking augmented method for MLLMs. Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base.
arXiv Detail & Related papers (2024-03-20T17:59:55Z)
Language-Driven Visual Consensus for Zero-Shot Semantic Segmentation [114.72734384299476]
We propose a Language-Driven Visual Consensus (LDVC) approach, fostering improved alignment of semantic and visual information. We leverage class embeddings as anchors due to their discrete and abstract nature, steering vision features toward class embeddings. Our approach significantly boosts the capacity of segmentation models for unseen classes.
arXiv Detail & Related papers (2024-03-13T11:23:55Z)
Text Augmented Spatial-aware Zero-shot Referring Image Segmentation [60.84423786769453]
We introduce a Text Augmented Spatial-aware (TAS) zero-shot referring image segmentation framework. TAS incorporates a mask proposal network for instance-level mask extraction, a text-augmented visual-text matching score for mining the image-text correlation, and a spatial for mask post-processing. The proposed method clearly outperforms state-of-the-art zero-shot referring image segmentation methods.
arXiv Detail & Related papers (2023-10-27T10:52:50Z)
A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained Vision-language Model [61.58071099082296]
It is unclear how to make zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation. In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP. Our experimental results show that this simple framework surpasses previous state-of-the-arts by a large margin.
arXiv Detail & Related papers (2021-12-29T18:56:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.