Visual In-Context Prompting
- URL: http://arxiv.org/abs/2311.13601v1
- Date: Wed, 22 Nov 2023 18:59:48 GMT
- Title: Visual In-Context Prompting
- Authors: Feng Li, Qing Jiang, Hao Zhang, Tianhe Ren, Shilong Liu, Xueyan Zou,
Huaizhe Xu, Hongyang Li, Chunyuan Li, Jianwei Yang, Lei Zhang, Jianfeng Gao
- Abstract summary: In this paper, we introduce a universal visual in-context prompting framework for both vision tasks like open-set segmentation and detection.
We build on top of an encoder-decoder architecture, and develop a versatile prompt encoder to support a variety of prompts like strokes, boxes, and points.
Our extensive explorations show that the proposed visual in-context prompting elicits extraordinary referring and generic segmentation capabilities.
- Score: 100.93587329049848
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In-context prompting in large language models (LLMs) has become a prevalent
approach to improve zero-shot capabilities, but this idea is less explored in
the vision domain. Existing visual prompting methods focus on referring
segmentation to segment the most relevant object, falling short of addressing
many generic vision tasks like open-set segmentation and detection. In this
paper, we introduce a universal visual in-context prompting framework for both
tasks. In particular, we build on top of an encoder-decoder architecture, and
develop a versatile prompt encoder to support a variety of prompts like
strokes, boxes, and points. We further enhance it to take an arbitrary number
of reference image segments as the context. Our extensive explorations show
that the proposed visual in-context prompting elicits extraordinary referring
and generic segmentation capabilities to refer and detect, yielding competitive
performance to close-set in-domain datasets and showing promising results on
many open-set segmentation datasets. By joint training on COCO and SA-1B, our
model achieves $57.7$ PQ on COCO and $23.2$ PQ on ADE20K. Code will be
available at https://github.com/UX-Decoder/DINOv.
Related papers
- More Pictures Say More: Visual Intersection Network for Open Set Object Detection [4.206612461069489]
We introduce a strong DETR-based model, Visual Intersection Network for Open Set Object Detection (VINO)
VINO constructs a multi-image visual bank to preserve the semantic intersections of each category across all time steps.
Our approach guarantees a more precise alignment between target category semantics and region semantics, while significantly reducing pre-training time and resource demands.
arXiv Detail & Related papers (2024-08-26T05:52:35Z) - Visual Prompt Selection for In-Context Learning Segmentation [77.15684360470152]
In this paper, we focus on rethinking and improving the example selection strategy.
We first demonstrate that ICL-based segmentation models are sensitive to different contexts.
Furthermore, empirical evidence indicates that the diversity of contextual prompts plays a crucial role in guiding segmentation.
arXiv Detail & Related papers (2024-07-14T15:02:54Z) - ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts [38.59120110371588]
We introduce a novel multimodal model capable of decoding arbitrary visual prompts.
This allows users to intuitively mark images and interact with the model using natural cues like a "red bounding box" or "pointed arrow"
Our simple design directly overlays visual markers onto the RGB image, eliminating the need for complex region encodings.
arXiv Detail & Related papers (2023-12-01T18:59:56Z) - Open-Vocabulary Camouflaged Object Segmentation [66.94945066779988]
We introduce a new task, open-vocabulary camouflaged object segmentation (OVCOS)
We construct a large-scale complex scene dataset (textbfOVCamo) containing 11,483 hand-selected images with fine annotations and corresponding object classes.
By integrating the guidance of class semantic knowledge and the supplement of visual structure cues from the edge and depth information, the proposed method can efficiently capture camouflaged objects.
arXiv Detail & Related papers (2023-11-19T06:00:39Z) - Segment Everything Everywhere All at Once [124.90835636901096]
We present SEEM, a promptable and interactive model for segmenting everything everywhere all at once in an image.
We propose a novel decoding mechanism that enables diverse prompting for all types of segmentation tasks.
We conduct a comprehensive empirical study to validate the effectiveness of SEEM across diverse segmentation tasks.
arXiv Detail & Related papers (2023-04-13T17:59:40Z) - A Simple Framework for Open-Vocabulary Segmentation and Detection [85.21641508535679]
We present OpenSeeD, a simple Open-vocabulary and Detection framework that jointly learns from different segmentation and detection datasets.
We first introduce a pre-trained text encoder to encode all the visual concepts in two tasks and learn a common semantic space for them.
After pre-training, our model exhibits competitive or stronger zero-shot transferability for both segmentation and detection.
arXiv Detail & Related papers (2023-03-14T17:58:34Z) - Open-world Semantic Segmentation via Contrasting and Clustering
Vision-Language Embedding [95.78002228538841]
We propose a new open-world semantic segmentation pipeline that makes the first attempt to learn to segment semantic objects of various open-world categories without any efforts on dense annotations.
Our method can directly segment objects of arbitrary categories, outperforming zero-shot segmentation methods that require data labeling on three benchmark datasets.
arXiv Detail & Related papers (2022-07-18T09:20:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.