Just a Few Glances: Open-Set Visual Perception with Image Prompt Paradigm
- URL: http://arxiv.org/abs/2412.10719v1
- Date: Sat, 14 Dec 2024 07:23:14 GMT
- Title: Just a Few Glances: Open-Set Visual Perception with Image Prompt Paradigm
- Authors: Jinrong Zhang, Penghui Wang, Chunxiao Liu, Wei Liu, Dian Jin, Qiong Zhang, Erli Meng, Zhengnan Hu,
- Abstract summary: Open-Set Object Detection (OSOD) and Open-Set Object (OSS) have attracted a surge of interest from researchers.
Mainstream OSOD and OSS methods generally utilize text as a prompt, achieving remarkable performance.
We propose a novel prompt paradigm in OSOD and OSS, that is, textbfImage Prompt Paradigm.
In this framework, high-quality image prompts are automatically encoded, selected and fused, achieving the single-stage and non-interactive inference.
- Score: 22.407887601771026
- License:
- Abstract: To break through the limitations of pre-training models on fixed categories, Open-Set Object Detection (OSOD) and Open-Set Segmentation (OSS) have attracted a surge of interest from researchers. Inspired by large language models, mainstream OSOD and OSS methods generally utilize text as a prompt, achieving remarkable performance. Following SAM paradigm, some researchers use visual prompts, such as points, boxes, and masks that cover detection or segmentation targets. Despite these two prompt paradigms exhibit excellent performance, they also reveal inherent limitations. On the one hand, it is difficult to accurately describe characteristics of specialized category using textual description. On the other hand, existing visual prompt paradigms heavily rely on multi-round human interaction, which hinders them being applied to fully automated pipeline. To address the above issues, we propose a novel prompt paradigm in OSOD and OSS, that is, \textbf{Image Prompt Paradigm}. This brand new prompt paradigm enables to detect or segment specialized categories without multi-round human intervention. To achieve this goal, the proposed image prompt paradigm uses just a few image instances as prompts, and we propose a novel framework named \textbf{MI Grounding} for this new paradigm. In this framework, high-quality image prompts are automatically encoded, selected and fused, achieving the single-stage and non-interactive inference. We conduct extensive experiments on public datasets, showing that MI Grounding achieves competitive performance on OSOD and OSS benchmarks compared to text prompt paradigm methods and visual prompt paradigm methods. Moreover, MI Grounding can greatly outperform existing method on our constructed specialized ADR50K dataset.
Related papers
- Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.
We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.
We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - Exploring Conditional Multi-Modal Prompts for Zero-shot HOI Detection [37.57355457749918]
We introduce a novel framework for zero-shot HOI detection using Conditional Multi-Modal Prompts, namely CMMP.
Unlike traditional prompt-learning methods, we propose learning decoupled vision and language prompts for interactiveness-aware visual feature extraction.
Experiments demonstrate the efficacy of our detector with conditional multi-modal prompts, outperforming previous state-of-the-art on unseen classes of various zero-shot settings.
arXiv Detail & Related papers (2024-08-05T14:05:25Z) - Beyond Mask: Rethinking Guidance Types in Few-shot Segmentation [67.35274834837064]
We develop a universal vision-language framework (UniFSS) to integrate prompts from text, mask, box, and image.
UniFSS significantly outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2024-07-16T08:41:01Z) - Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model [61.389233691596004]
We introduce the DiffPNG framework, which capitalizes on the diffusion's architecture for segmentation by decomposing the process into a sequence of localization, segmentation, and refinement steps.
Our experiments on the PNG dataset demonstrate that DiffPNG achieves strong performance in the zero-shot PNG task setting.
arXiv Detail & Related papers (2024-07-07T13:06:34Z) - Leveraging Open-Vocabulary Diffusion to Camouflaged Instance
Segmentation [59.78520153338878]
Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions.
We propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations.
arXiv Detail & Related papers (2023-12-29T07:59:07Z) - Relax Image-Specific Prompt Requirement in SAM: A Single Generic Prompt
for Segmenting Camouflaged Objects [32.14438610147615]
We introduce a test-time adaptation per-instance mechanism called Generalizable SAM (GenSAM) to automatically enerate and optimize visual prompts.
Experiments on three benchmarks demonstrate that GenSAM outperforms point supervision approaches.
arXiv Detail & Related papers (2023-12-12T15:43:36Z) - Aligning and Prompting Everything All at Once for Universal Visual
Perception [79.96124061108728]
APE is a universal visual perception model for aligning and prompting everything all at once in an image to perform diverse tasks.
APE advances the convergence of detection and grounding by reformulating language-guided grounding as open-vocabulary detection.
Experiments on over 160 datasets demonstrate that APE outperforms state-of-the-art models.
arXiv Detail & Related papers (2023-12-04T18:59:50Z) - Towards Open-Ended Visual Recognition with Large Language Model [27.56182473356992]
We introduce the OmniScient Model (OSM), a novel Large Language Model (LLM) based mask classifier.
OSM predicts class labels in a generative manner, thus removing the supply of class names during both training and testing.
It also enables cross-dataset training without any human interference.
arXiv Detail & Related papers (2023-11-14T18:59:01Z) - Explicit Visual Prompting for Universal Foreground Segmentations [55.51869354956533]
We present a unified framework for a number of foreground segmentation tasks without any task-specific designs.
We take inspiration from the widely-used pre-training and then prompt tuning protocols in NLP.
Our method freezes a pre-trained model and then learns task-specific knowledge using a few extra parameters.
arXiv Detail & Related papers (2023-05-29T11:05:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.