Related papers: Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels

Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels

URL: http://arxiv.org/abs/2505.13788v1
Date: Tue, 20 May 2025 00:37:19 GMT
Title: Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels
Authors: Yongshuo Zong, Qin Zhang, Dongsheng An, Zhihua Li, Xiang Xu, Linghan Xu, Zhuowen Tu, Yifan Xing, Onkar Dabeer,
Abstract summary: We address five critical real-world challenges in text-instruction-based grounding.<n>Our approach generates high-quality instruction-response pairs linked to existing pixel-level annotations.<n>Experiment results show that models trained on Ground-V exhibit substantial improvements across diverse grounding tasks.
Score: 30.722073025794025
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This work presents a simple yet effective workflow for automatically scaling instruction-following data to elicit pixel-level grounding capabilities of VLMs under complex instructions. In particular, we address five critical real-world challenges in text-instruction-based grounding: hallucinated references, multi-object scenarios, reasoning, multi-granularity, and part-level references. By leveraging knowledge distillation from a pre-trained teacher model, our approach generates high-quality instruction-response pairs linked to existing pixel-level annotations, minimizing the need for costly human annotation. The resulting dataset, Ground-V, captures rich object localization knowledge and nuanced pixel-level referring expressions. Experiment results show that models trained on Ground-V exhibit substantial improvements across diverse grounding tasks. Specifically, incorporating Ground-V during training directly achieves an average accuracy boost of 4.4% for LISA and a 7.9% for PSALM across six benchmarks on the gIoU metric. It also sets new state-of-the-art results on standard benchmarks such as RefCOCO/+/g. Notably, on gRefCOCO, we achieve an N-Acc of 83.3%, exceeding the previous state-of-the-art by more than 20%.

Related papers

Grounding Computer Use Agents on Human Demonstrations [40.66362945241247]
GroundCUA is a large-scale desktop grounding dataset built from expert human demonstrations.<n>It covers 87 applications across 12 categories and includes 56K screenshots, with every on-screen element carefully annotated for a total of over 3.56M human-verified annotations.<n>Using GroundCUA, we develop the GroundNext family of models that map instructions to their target UI elements.
arXiv Detail & Related papers (2025-11-10T17:35:21Z)
GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding [39.967352995143855]
GroundingSuite aims to bridge the gap between vision and language modalities.<n>It comprises: (1) an automated data annotation framework leveraging multiple Vision-Language Model (VLM) agents; (2) a large-scale training dataset encompassing 9.56 million diverse referring expressions and their corresponding segmentations; and (3) a meticulously curated evaluation benchmark consisting of 3,800 images.
arXiv Detail & Related papers (2025-03-13T17:43:10Z)
SURDS: Benchmarking Spatial Understanding and Reasoning in Driving Scenarios with Vision Language Models [15.50826328938879]
We introduce SURDS, a benchmark designed to evaluate the spatial reasoning capabilities of vision language models (VLMs)<n>Built on the nuScenes dataset, SURDS comprises 41,080 vision-question-answer training instances and 9,250 evaluation samples.<n>We propose a reinforcement learning-based alignment scheme leveraging spatially grounded reward signals.
arXiv Detail & Related papers (2024-11-20T08:14:01Z)
RS-GPT4V: A Unified Multimodal Instruction-Following Dataset for Remote Sensing Image Understanding [4.266920365127677]
Under the new LaGD paradigm, the old datasets are no longer suitable for fire-new tasks. We designed a high-quality, diversified, and unified multimodal instruction-following dataset for RSI understanding. The empirical results show that the fine-tuned MLLMs by RS-GPT4V can describe fine-grained information.
arXiv Detail & Related papers (2024-06-18T10:34:28Z)
Less is More: High-value Data Selection for Visual Instruction Tuning [127.38740043393527]
We propose a high-value data selection approach TIVE, to eliminate redundancy within the visual instruction data and reduce the training cost. Our approach using only about 15% data can achieve comparable average performance to the full-data fine-tuned model across eight benchmarks.
arXiv Detail & Related papers (2024-03-14T16:47:25Z)
GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection [24.48128633414131]
We propose a zero-shot method that harnesses visual grounding ability from existing models trained from image-text pairs and pure object detection data. We demonstrate that the proposed method significantly outperforms other zero-shot methods on RefCOCO/+/g datasets.
arXiv Detail & Related papers (2023-12-22T20:14:55Z)
Silkie: Preference Distillation for Large Visual Language Models [56.10697821410489]
This paper explores preference distillation for large vision language models (LVLMs) We first build a vision-language feedback dataset utilizing AI annotation. We adopt GPT-4V to assess the generated outputs regarding helpfulness, visual faithfulness, and ethical considerations. The resulting model Silkie, achieves 6.9% and 9.5% relative improvement on the MME benchmark regarding the perception and cognition capabilities.
arXiv Detail & Related papers (2023-12-17T09:44:27Z)
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs [49.88461345825586]
This paper proposes a new framework to enhance the fine-grained image understanding abilities of MLLMs. We present a new method for constructing the instruction tuning dataset at a low cost by leveraging annotations in existing datasets. We show that our model exhibits a 5.2% accuracy improvement over Qwen-VL and surpasses the accuracy of Kosmos-2 by 24.7%.
arXiv Detail & Related papers (2023-10-01T05:53:15Z)
AnomalyGPT: Detecting Industrial Anomalies Using Large Vision-Language Models [30.723122000372538]
AnomalyGPT is a novel IAD approach based on Large Vision-Language Models (LVLM) We generate training data by simulating anomalous images and producing corresponding textual descriptions for each image. AnomalyGPT achieves the state-of-the-art performance with an accuracy of 86.1%, an image-level AUC of 94.1%, and a pixel-level AUC of 95.3% on the MVTec-AD dataset.
arXiv Detail & Related papers (2023-08-29T15:02:53Z)
MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks. We propose a single-stage and standalone method, MOCA, which unifies both desired properties. We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z)
Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations [58.442103936918805]
We show that Attention Mask Consistency produces superior visual grounding results than previous methods. AMC is effective, easy to implement, and is general as it can be adopted by any vision-language model.
arXiv Detail & Related papers (2022-06-30T17:55:12Z)
Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning [60.75687261314962]
We introduce pixel-level pretext tasks for learning dense feature representations. A pixel-to-propagation consistency task produces better results than state-of-the-art approaches. Results demonstrate the strong potential of defining pretext tasks at the pixel level.
arXiv Detail & Related papers (2020-11-19T18:59:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.