Related papers: Visual Position Prompt for MLLM based Visual Grounding

Visual Position Prompt for MLLM based Visual Grounding

URL: http://arxiv.org/abs/2503.15426v4
Date: Wed, 16 Jul 2025 13:53:05 GMT
Title: Visual Position Prompt for MLLM based Visual Grounding
Authors: Wei Tang, Yanpeng Sun, Qinying Gu, Zechao Li,
Abstract summary: We introduce VPP-LLaVA, an MLLM enhanced with Visual Position Prompt to improve its grounding capability.<n>We also introduce VPP-SFT, a curated dataset of 0.6M high-quality visual grounding samples.<n>The resulting model achieves state-of-the-art results on standard visual grounding benchmarks.
Score: 29.34950670755899
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Although Multimodal Large Language Models (MLLMs) excel at various image-related tasks, they encounter challenges in precisely aligning coordinates with spatial information within images, particularly in position-aware tasks such as visual grounding. This limitation arises from two key factors. First, MLLMs lack explicit spatial references, making it difficult to associate textual descriptions with precise image locations. Second, their feature extraction processes prioritize global context over fine-grained spatial details, leading to weak localization capability. To address these issues, we introduce VPP-LLaVA, an MLLM enhanced with Visual Position Prompt (VPP) to improve its grounding capability. VPP-LLaVA integrates two complementary mechanisms: the global VPP overlays a learnable, axis-like tensor onto the input image to provide structured spatial cues, while the local VPP incorporates position-aware queries to support fine-grained localization.To effectively train our model with spatial guidance, we further introduce VPP-SFT, a curated dataset of 0.6M high-quality visual grounding samples. Designed in a compact format, it enables efficient training and is significantly smaller than datasets used by other MLLMs (e.g., ~21M samples in MiniGPT-v2), yet still provides a strong performance boost. The resulting model, VPP-LLaVA, not only achieves state-of-the-art results on standard visual grounding benchmarks but also demonstrates strong zero-shot generalization to challenging unseen datasets. The code and dataset are available at https://github.com/WayneTomas/VPP-LLaVA.

Related papers

FineRS: Fine-grained Reasoning and Segmentation of Small Objects with Reinforcement Learning [62.11389260206383]
textscFineRS is a two-stage MLLM-based reinforcement learning framework for segmenting extremely small objects.<n>We present textscFineRS-4k, a new dataset for evaluating MLLMs on attribute-level reasoning and pixel-level segmentation on subtle, small-scale targets.
arXiv Detail & Related papers (2025-10-24T10:14:17Z)
LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs [4.478610052538001]
We propose LLaVA-SP, which only adds six spatial visual tokens to the original visual tokens to enhance the visual representation.<n>We present two model variants: LLaVA-SP-Cropping, which focuses on detail features through progressive cropping, and LLaVA-SP-Pooling, which captures global semantics through adaptive pooling.
arXiv Detail & Related papers (2025-07-01T07:20:11Z)
GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing [22.729750410621826]
GeoPix is a RS MLLM that extends image understanding capabilities to the pixel level. To facilitate the segmentation of multi-scale objects in RS imagery, a class-wise learnable memory module is integrated into the mask predictor. To address the absence of large-scale datasets for training pixel-level RS MLLMs, we construct the GeoPixInstruct dataset.
arXiv Detail & Related papers (2025-01-12T14:45:27Z)
Learning to Rank Pre-trained Vision-Language Models for Downstream Tasks [41.488394198111976]
Vision language models (VLMs) like CLIP show stellar zero-shot capability on classification benchmarks.<n> selecting the VLM with the highest performance on the unlabeled downstream task is non-trivial.<n>This paper introduces the problem of textbfunsupervised vision-language model selection, where only unsupervised downstream datasets are available.
arXiv Detail & Related papers (2024-12-30T03:26:53Z)
INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model [71.50973774576431]
We propose a novel MLLM, INF-LLaVA, designed for effective high-resolution image perception. We introduce a Dual-perspective Cropping Module (DCM), which ensures that each sub-image contains continuous details from a local perspective. Second, we introduce Dual-perspective Enhancement Module (DEM) to enable the mutual enhancement of global and local features.
arXiv Detail & Related papers (2024-07-23T06:02:30Z)
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We present the Draw-and-Understand framework, exploring how to integrate visual prompting understanding capabilities into Multimodal Large Language Models (MLLMs)<n>Visual prompts allow users to interact through multi-modal instructions, enhancing the models' interactivity and fine-grained image comprehension.<n>In this framework, we propose a general architecture adaptable to different pre-trained MLLMs, enabling it to recognize various types of visual prompts.
arXiv Detail & Related papers (2024-03-29T16:26:20Z)
PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs [55.8550939439138]
Vision-Language Models (VLMs) have shown immense potential by integrating large language models with vision systems. These models face challenges in the fundamental computer vision task of object localisation, due to their training on multimodal data containing mostly captions. We introduce an input-agnostic Positional Insert (PIN), a learnable spatial prompt, containing a minimal set of parameters that are slid inside the frozen VLM. Our PIN module is trained with a simple next-token prediction task on synthetic data without requiring the introduction of new output heads.
arXiv Detail & Related papers (2024-02-13T18:39:18Z)
GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection [24.48128633414131]
We propose a zero-shot method that harnesses visual grounding ability from existing models trained from image-text pairs and pure object detection data. We demonstrate that the proposed method significantly outperforms other zero-shot methods on RefCOCO/+/g datasets.
arXiv Detail & Related papers (2023-12-22T20:14:55Z)
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model [83.85856356798531]
VistaLLM is a visual system that addresses coarse- and fine-grained vision-language tasks. It employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences. We also introduce a novel task, AttCoSeg, which boosts the model's reasoning and grounding capability over multiple input images.
arXiv Detail & Related papers (2023-12-19T18:53:01Z)
Optimal Transport Aggregation for Visual Place Recognition [9.192660643226372]
We introduce SALAD, which reformulates NetVLAD's soft-assignment of local features to clusters as an optimal transport problem. In SALAD, we consider both feature-to-cluster and cluster-to-feature relations and we also introduce a 'dustbin' cluster, designed to selectively discard features deemed non-informative. Our single-stage method surpasses single-stage baselines in public VPR datasets, but also surpasses two-stage methods that add a re-ranking with significantly higher cost.
arXiv Detail & Related papers (2023-11-27T15:46:19Z)
Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models [50.07056960586183]
We propose Position-enhanced Visual Instruction Tuning (PVIT) to extend the functionality of Multimodal Large Language Models (MLLMs) This integration promotes a more detailed comprehension of images for the MLLM. We present both quantitative experiments and qualitative analysis that demonstrate the superiority of the proposed model.
arXiv Detail & Related papers (2023-08-25T15:33:47Z)
ProbVLM: Probabilistic Adapter for Frozen Vision-Language Models [69.50316788263433]
We propose ProbVLM, a probabilistic adapter that estimates probability distributions for the embeddings of pre-trained vision-language models. We quantify the calibration of embedding uncertainties in retrieval tasks and show that ProbVLM outperforms other methods. We present a novel technique for visualizing the embedding distributions using a large-scale pre-trained latent diffusion model.
arXiv Detail & Related papers (2023-07-01T18:16:06Z)
Adapting Pre-trained Language Models to Vision-Language Tasks via Dynamic Visual Prompting [83.21164539349273]
Pre-trained language models (PLMs) have played an increasing role in multimedia research. In this paper, we focus on exploring PLMs as a stand-alone model for vision-language reasoning tasks. We propose a novel transfer learning approach for PLMs, termed Dynamic Visual Prompting (DVP)
arXiv Detail & Related papers (2023-06-01T07:19:28Z)
GeoVLN: Learning Geometry-Enhanced Visual Representation with Slot Attention for Vision-and-Language Navigation [52.65506307440127]
We propose GeoVLN, which learns Geometry-enhanced visual representation based on slot attention for robust Visual-and-Language Navigation. We employ V&L BERT to learn a cross-modal representation that incorporate both language and vision informations.
arXiv Detail & Related papers (2023-05-26T17:15:22Z)
Position-guided Text Prompt for Vision-Language Pre-training [121.15494549650548]
We propose a novel Position-guided Text Prompt (PTP) paradigm to enhance the visual grounding ability of cross-modal models trained with Vision-Language Pre-Training. PTP reformulates the visual grounding task into a fill-in-the-blank problem given a PTP by encouraging the model to predict the objects in the given blocks or regress the blocks of a given object. PTP achieves comparable results with object-detector based methods, and much faster inference speed since PTP discards its object detector for inference while the later cannot.
arXiv Detail & Related papers (2022-12-19T18:55:43Z)
GLIPv2: Unifying Localization and Vision-Language Understanding [161.1770269829139]
We present GLIPv2, a grounded VL understanding model, that serves both localization tasks and Vision-Language (VL) understanding tasks. GLIPv2 unifies localization pre-training and Vision-Language Pre-training with three pre-training tasks. We show that a single GLIPv2 model achieves near SoTA performance on various localization and understanding tasks.
arXiv Detail & Related papers (2022-06-12T20:31:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.