Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
- URL: http://arxiv.org/abs/2310.11441v2
- Date: Mon, 6 Nov 2023 07:39:49 GMT
- Title: Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
- Authors: Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, Jianfeng
Gao
- Abstract summary: We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models.
We employ off-the-shelf interactive segmentation models, such as SEEM/SAM, to partition an image into regions, and overlay these regions with a set of marks.
Using the marked image as input, GPT-4V can answer the questions that require visual grounding.
- Score: 103.68138147783614
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Set-of-Mark (SoM), a new visual prompting method, to unleash the
visual grounding abilities of large multimodal models (LMMs), such as GPT-4V.
As illustrated in Fig. 1 (right), we employ off-the-shelf interactive
segmentation models, such as SEEM/SAM, to partition an image into regions at
different levels of granularity, and overlay these regions with a set of marks
e.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V can
answer the questions that require visual grounding. We perform a comprehensive
empirical study to validate the effectiveness of SoM on a wide range of
fine-grained vision and multimodal tasks. For example, our experiments show
that GPT-4V with SoM in zero-shot setting outperforms the state-of-the-art
fully-finetuned referring expression comprehension and segmentation model on
RefCOCOg. Code for SoM prompting is made public at:
https://github.com/microsoft/SoM.
Related papers
- List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs [160.6296629396925]
"List items one by one" asks the model to enumerate and describe all visual tags placed on the image following the alphanumeric orders of tags.
We find that this new dataset, even in a relatively small size (10k-30k images with tags), significantly enhances visual reasoning capabilities and reduces hallucinations for MLLMs.
arXiv Detail & Related papers (2024-04-25T07:29:17Z) - Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model [83.85856356798531]
VistaLLM is a visual system that addresses coarse- and fine-grained vision-language tasks.
It employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences.
We also introduce a novel task, AttCoSeg, which boosts the model's reasoning and grounding capability over multiple input images.
arXiv Detail & Related papers (2023-12-19T18:53:01Z) - GLaMM: Pixel Grounding Large Multimodal Model [57.91763410032292]
We present Grounding LMM (GLaMM), the first model that can generate natural language responses seamlessly intertwined with corresponding object segmentation masks.
GLaMM is flexible enough to accept both textual and optional visual prompts (region of interest) as input.
Our proposed GCG task requires densely grounded concepts in natural scenes at a large-scale.
arXiv Detail & Related papers (2023-11-06T18:59:57Z) - GPT-4V-AD: Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly Detection [51.43589678946244]
This paper explores the potential of VQA-oriented GPT-4V in the popular visual Anomaly Detection (AD) task.
It is the first to conduct qualitative and quantitative evaluations on the popular MVTec AD and VisA datasets.
arXiv Detail & Related papers (2023-11-05T10:01:18Z) - Scene Graph as Pivoting: Inference-time Image-free Unsupervised
Multimodal Machine Translation with Visual Scene Hallucination [88.74459704391214]
In this work, we investigate a more realistic unsupervised multimodal machine translation (UMMT) setup.
We represent the input images and texts with the visual and language scene graphs (SG), where such fine-grained vision-language features ensure a holistic understanding of the semantics.
Several SG-pivoting based learning objectives are introduced for unsupervised translation training.
Our method outperforms the best-performing baseline by significant BLEU scores on the task and setup.
arXiv Detail & Related papers (2023-05-20T18:17:20Z) - Text2Seg: Remote Sensing Image Semantic Segmentation via Text-Guided
Visual Foundation Models [5.360103006279672]
This study focuses on the remote sensing domain, where the images are notably dissimilar from those in conventional scenarios.
We developed a pipeline that leverages multiple foundation models to facilitate remote sensing image semantic segmentation tasks guided by text prompt.
The pipeline is benchmarked on several widely-used remote sensing datasets, and we present preliminary results to demonstrate its effectiveness.
arXiv Detail & Related papers (2023-04-20T18:39:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.