Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
- URL: http://arxiv.org/abs/2310.11441v2
- Date: Mon, 6 Nov 2023 07:39:49 GMT
- Title: Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
- Authors: Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, Jianfeng
Gao
- Abstract summary: We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models.
We employ off-the-shelf interactive segmentation models, such as SEEM/SAM, to partition an image into regions, and overlay these regions with a set of marks.
Using the marked image as input, GPT-4V can answer the questions that require visual grounding.
- Score: 103.68138147783614
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Set-of-Mark (SoM), a new visual prompting method, to unleash the
visual grounding abilities of large multimodal models (LMMs), such as GPT-4V.
As illustrated in Fig. 1 (right), we employ off-the-shelf interactive
segmentation models, such as SEEM/SAM, to partition an image into regions at
different levels of granularity, and overlay these regions with a set of marks
e.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V can
answer the questions that require visual grounding. We perform a comprehensive
empirical study to validate the effectiveness of SoM on a wide range of
fine-grained vision and multimodal tasks. For example, our experiments show
that GPT-4V with SoM in zero-shot setting outperforms the state-of-the-art
fully-finetuned referring expression comprehension and segmentation model on
RefCOCOg. Code for SoM prompting is made public at:
https://github.com/microsoft/SoM.
Related papers
- MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs [61.56904387052982]
This paper proposes a new visual grounding task called multi-context visual grounding.
It aims to localize instances of interest across multiple images based on open-ended text prompts.
We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities.
arXiv Detail & Related papers (2024-10-16T07:52:57Z) - OmniParser for Pure Vision Based GUI Agent [37.911094082816504]
Power multimodal models like GPT-4V as a general agent on multiple operating systems are largely underestimated due to the lack of a robust screen parsing technique.
textsc Omni significantly improves GPT-4V's performance on ScreenSpot benchmark.
textsc Omni with screenshot only outperforms GPT-4V baselines requiring additional information outside of screenshot.
arXiv Detail & Related papers (2024-08-01T00:00:43Z) - List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs [160.6296629396925]
"List items one by one" asks the model to enumerate and describe all visual tags placed on the image following the alphanumeric orders of tags.
We find that this new dataset, even in a relatively small size (10k-30k images with tags), significantly enhances visual reasoning capabilities and reduces hallucinations for MLLMs.
arXiv Detail & Related papers (2024-04-25T07:29:17Z) - Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model [83.85856356798531]
VistaLLM is a visual system that addresses coarse- and fine-grained vision-language tasks.
It employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences.
We also introduce a novel task, AttCoSeg, which boosts the model's reasoning and grounding capability over multiple input images.
arXiv Detail & Related papers (2023-12-19T18:53:01Z) - GLaMM: Pixel Grounding Large Multimodal Model [57.91763410032292]
We present Grounding LMM (GLaMM), the first model that can generate natural language responses seamlessly intertwined with corresponding object segmentation masks.
GLaMM is flexible enough to accept both textual and optional visual prompts (region of interest) as input.
Our proposed GCG task requires densely grounded concepts in natural scenes at a large-scale.
arXiv Detail & Related papers (2023-11-06T18:59:57Z) - GPT-4V-AD: Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly Detection [51.43589678946244]
This paper explores the potential of VQA-oriented GPT-4V in the popular visual Anomaly Detection (AD) task.
It is the first to conduct qualitative and quantitative evaluations on the popular MVTec AD and VisA datasets.
arXiv Detail & Related papers (2023-11-05T10:01:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.