Multimodal Reference Visual Grounding
- URL: http://arxiv.org/abs/2504.02876v1
- Date: Wed, 02 Apr 2025 00:19:05 GMT
- Title: Multimodal Reference Visual Grounding
- Authors: Yangxiao Lu, Ruosen Li, Liqiang Jing, Jikai Wang, Xinya Du, Yunhui Guo, Nicholas Ruozzi, Yu Xiang,
- Abstract summary: Visual grounding focuses on detecting objects from images based on language expressions.<n>Recent Large Vision-Language Models (LVLMs) have significantly advanced visual grounding performance.<n>We introduce a new task named Multimodal Reference Visual Grounding (MRVG)<n>We show that our method achieves superior visual grounding performance compared to the state-of-the-art LVLMs.
- Score: 24.047088603900644
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual grounding focuses on detecting objects from images based on language expressions. Recent Large Vision-Language Models (LVLMs) have significantly advanced visual grounding performance by training large models with large-scale datasets. However, the problem remains challenging, especially when similar objects appear in the input image. For example, an LVLM may not be able to differentiate Diet Coke and regular Coke in an image. In this case, if additional reference images of Diet Coke and regular Coke are available, it can help the visual grounding of similar objects. In this work, we introduce a new task named Multimodal Reference Visual Grounding (MRVG). In this task, a model has access to a set of reference images of objects in a database. Based on these reference images and a language expression, the model is required to detect a target object from a query image. We first introduce a new dataset to study the MRVG problem. Then we introduce a novel method, named MRVG-Net, to solve this visual grounding problem. We show that by efficiently using reference images with few-shot object detection and using Large Language Models (LLMs) for object matching, our method achieves superior visual grounding performance compared to the state-of-the-art LVLMs such as Qwen2.5-VL-7B. Our approach bridges the gap between few-shot detection and visual grounding, unlocking new capabilities for visual understanding. Project page with our code and dataset: https://irvlutd.github.io/MultiGrounding
Related papers
- ABC: Achieving Better Control of Multimodal Embeddings using VLMs [61.396457715710774]
Visual embedding models excel at zero-shot tasks like visual retrieval and classification.<n>Existing CLIP-based approaches embed images and text independently, and fuse the result.<n>We introduce ABC, an open-source multimodal embedding model that uses a vision-language model backbone.
arXiv Detail & Related papers (2025-03-01T03:29:02Z) - Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
We find that present-day Vision-Language Models (VLMs) lack a fundamental cognitive ability: learning to localize specific objects in a scene by taking into account the context.<n>This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z) - ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual Grounding [42.10086029931937]
Visual grounding aims to localize the object referred to in an image based on a natural language query.
Existing methods demonstrate a significant performance drop when there are multiple distractions in an image.
We propose a novel approach, the Relation and Semantic-sensitive Visual Grounding (ResVG) model, to address this issue.
arXiv Detail & Related papers (2024-08-29T07:32:01Z) - Targeted Visual Prompting for Medical Visual Question Answering [3.600327818936722]
multimodal large language models (MLLMs) have emerged as an alternative to classical model architectures.
Simple visual errors cast doubt on the actual visual understanding abilities of these models.
This paper introduces targeted visual prompting to equip MLLMs with region-based questioning capabilities.
arXiv Detail & Related papers (2024-08-06T08:58:20Z) - VisMin: Visual Minimal-Change Understanding [7.226130826257802]
We introduce a new, challenging benchmark termed Visual Minimal-Change Understanding (VisMin)<n>VisMin requires models to predict the correct image-caption match given two images and two captions.<n>We build an automatic framework using large language models and diffusion models, followed by a rigorous 4-step verification process by human annotators.
arXiv Detail & Related papers (2024-07-23T18:10:43Z) - Learning Visual Grounding from Generative Vision and Language Model [29.2712567454021]
Visual grounding tasks aim to localize image regions based on natural language references.
We find that grounding knowledge already exists in generative VLM and can be elicited by proper prompting.
Our results demonstrate the promise of generative VLM to scale up visual grounding in the real world.
arXiv Detail & Related papers (2024-07-18T20:29:49Z) - List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs [160.6296629396925]
"List items one by one" asks the model to enumerate and describe all visual tags placed on the image following the alphanumeric orders of tags.
We find that this new dataset, even in a relatively small size (10k-30k images with tags), significantly enhances visual reasoning capabilities and reduces hallucinations for MLLMs.
arXiv Detail & Related papers (2024-04-25T07:29:17Z) - Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets.
We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models.
Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z) - Visual Named Entity Linking: A New Dataset and A Baseline [61.38231023490981]
We consider a purely Visual-based Named Entity Linking (VNEL) task, where the input only consists of an image.
We propose three different sub-tasks, i.e., visual to visual entity linking (V2VEL), visual to textual entity linking (V2TEL), and visual to visual-textual entity linking (V2VTEL)
We present a high-quality human-annotated visual person linking dataset, named WIKIPerson.
arXiv Detail & Related papers (2022-11-09T13:27:50Z) - Multi-modal Transformers Excel at Class-agnostic Object Detection [105.10403103027306]
We argue that existing methods lack a top-down supervision signal governed by human-understandable semantics.
We develop an efficient and flexible MViT architecture using multi-scale feature processing and deformable self-attention.
We show the significance of MViT proposals in a diverse range of applications.
arXiv Detail & Related papers (2021-11-22T18:59:29Z) - Learning to Ground Visual Objects for Visual Dialog [26.21407651331964]
We propose a novel approach to Learn to Ground visual objects for visual dialog.
A posterior distribution over visual objects is inferred from both context (history and questions) and answers.
A prior distribution, which is inferred from context only, is used to approximate the posterior distribution so that appropriate visual objects can be grounded even without answers.
arXiv Detail & Related papers (2021-09-13T14:48:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.