ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual   Prompts
        - URL: http://arxiv.org/abs/2312.00784v2
- Date: Sat, 27 Apr 2024 01:53:39 GMT
- Title: ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual   Prompts
- Authors: Mu Cai, Haotian Liu, Dennis Park, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Yong Jae Lee, 
- Abstract summary: We introduce a novel multimodal model capable of decoding arbitrary visual prompts.
This allows users to intuitively mark images and interact with the model using natural cues like a "red bounding box" or "pointed arrow"
Our simple design directly overlays visual markers onto the RGB image, eliminating the need for complex region encodings.
- Score: 38.59120110371588
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   While existing large vision-language multimodal models focus on whole image understanding, there is a prominent gap in achieving region-specific comprehension. Current approaches that use textual coordinates or spatial encodings often fail to provide a user-friendly interface for visual prompting. To address this challenge, we introduce a novel multimodal model capable of decoding arbitrary visual prompts. This allows users to intuitively mark images and interact with the model using natural cues like a "red bounding box" or "pointed arrow". Our simple design directly overlays visual markers onto the RGB image, eliminating the need for complex region encodings, yet achieves state-of-the-art performance on region-understanding tasks like Visual7W, PointQA, and Visual Commonsense Reasoning benchmark. Furthermore, we present ViP-Bench, a comprehensive benchmark to assess the capability of models in understanding visual prompts across multiple dimensions, enabling future research in this domain. Code, data, and model are publicly available. 
 
      
        Related papers
        - ABC: Achieving Better Control of Multimodal Embeddings using VLMs [61.396457715710774]
 Visual embedding models excel at zero-shot tasks like visual retrieval and classification.
Existing CLIP-based approaches embed images and text independently, and fuse the result.
We introduce ABC, an open-source multimodal embedding model that uses a vision-language model backbone.
 arXiv  Detail & Related papers  (2025-03-01T03:29:02Z)
- UniRS: Unifying Multi-temporal Remote Sensing Tasks through Vision   Language Models [23.044366104080822]
 We introduce textbfUniRS, the first vision-language model bftextremote bftextsensing tasks across various types of visual input.
UniRS supports single images, dual-time image pairs, and videos as input, enabling comprehensive remote sensing temporal analysis.
 Experimental results show that UniRS achieves state-of-the-art performance across diverse tasks.
 arXiv  Detail & Related papers  (2024-12-30T06:34:18Z)
- More Pictures Say More: Visual Intersection Network for Open Set Object   Detection [4.206612461069489]
 We introduce a strong DETR-based model, Visual Intersection Network for Open Set Object Detection (VINO)
VINO constructs a multi-image visual bank to preserve the semantic intersections of each category across all time steps.
Our approach guarantees a more precise alignment between target category semantics and region semantics, while significantly reducing pre-training time and resource demands.
 arXiv  Detail & Related papers  (2024-08-26T05:52:35Z)
- EarthMarker: Visual Prompt Learning for Region-level and Point-level   Remote Sensing Imagery Comprehension [12.9701635989222]
 The first visual prompting model named EarthMarker is proposed, which excels in image-level, region-level, and point-level RS imagery interpretation.
To endow the EarthMarker with versatile multi-granularity visual perception abilities, the cross-domain phased learning strategy is developed.
To tackle the lack of RS visual prompting data, a dataset named RSVP featuring multi-modal fine-grained visual prompting instruction is constructed.
 arXiv  Detail & Related papers  (2024-07-18T15:35:00Z)
- Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to   Comprehend What You Want [58.091825321168514]
 We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.
Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM.
To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
 arXiv  Detail & Related papers  (2024-03-29T16:26:20Z)
- Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language   Models [81.71651422951074]
 Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
 arXiv  Detail & Related papers  (2024-03-19T17:59:52Z)
- Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling   and Visual-Language Co-Referring [27.45225442048711]
 We introduce a unified high-resolution generalist model, Griffon v2, enabling flexible object referring with visual and textual prompts.
We design a simple and lightweight down-sampling projector to overcome the input tokens constraint in Large Language Models.
Experiments demonstrate that Griffon v2 can localize any objects of interest with visual and textual referring, achieve state-of-the-art performance on REC, phrase grounding, and REG tasks, and outperform expert models in object detection and object counting.
 arXiv  Detail & Related papers  (2024-03-14T12:21:37Z)
- Chat-UniVi: Unified Visual Representation Empowers Large Language Models   with Image and Video Understanding [55.65727739645824]
 Chat-UniVi is a Unified Vision-language model capable of comprehending and engaging in conversations involving images and videos.
We employ a set of dynamic visual tokens to uniformly represent images and videos.
We leverage a multi-scale representation, enabling the model to perceive both high-level semantic concepts and low-level visual details.
 arXiv  Detail & Related papers  (2023-11-14T10:11:36Z)
- GeoVLN: Learning Geometry-Enhanced Visual Representation with Slot
  Attention for Vision-and-Language Navigation [52.65506307440127]
 We propose GeoVLN, which learns Geometry-enhanced visual representation based on slot attention for robust Visual-and-Language Navigation.
We employ V&L BERT to learn a cross-modal representation that incorporate both language and vision informations.
 arXiv  Detail & Related papers  (2023-05-26T17:15:22Z)
- What does CLIP know about a red circle? Visual prompt engineering for
  VLMs [116.8806079598019]
 We explore the idea of visual prompt engineering for solving computer vision tasks beyond classification by editing in image space instead of text.
We show the power of this simple approach by achieving state-of-the-art in zero-shot referring expressions comprehension and strong performance in keypoint localization tasks.
 arXiv  Detail & Related papers  (2023-04-13T17:58:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.