Related papers: GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models

GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models

URL: http://arxiv.org/abs/2601.04777v1
Date: Thu, 08 Jan 2026 09:58:35 GMT
Title: GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models
Authors: Shurong Zheng, Yousong Zhu, Hongyin Zhao, Fan Yang, Yufei Zhan, Ming Tang, Jinqiao Wang,
Abstract summary: Multimodal Large Language Models (MLLMs) have demonstrated impressive progress in single-image grounding and general multi-image understanding.<n>We propose GeM-VG, an MLLM capable of Generalized Multi-image Visual Grounding.
Score: 30.759062684007873
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated impressive progress in single-image grounding and general multi-image understanding. Recently, some methods begin to address multi-image grounding. However, they are constrained by single-target localization and limited types of practical tasks, due to the lack of unified modeling for generalized grounding tasks. Therefore, we propose GeM-VG, an MLLM capable of Generalized Multi-image Visual Grounding. To support this, we systematically categorize and organize existing multi-image grounding tasks according to their reliance of cross-image cues and reasoning, and introduce the MG-Data-240K dataset, addressing the limitations of existing datasets regarding target quantity and image relation. To tackle the challenges of robustly handling diverse multi-image grounding tasks, we further propose a hybrid reinforcement finetuning strategy that integrates chain-of-thought (CoT) reasoning and direct answering, considering their complementary strengths. This strategy adopts an R1-like algorithm guided by a carefully designed rule-based reward, effectively enhancing the model's overall perception and reasoning capabilities. Extensive experiments demonstrate the superior generalized grounding capabilities of our model. For multi-image grounding, it outperforms the previous leading MLLMs by 2.0% and 9.7% on MIG-Bench and MC-Bench, respectively. In single-image grounding, it achieves a 9.1% improvement over the base model on ODINW. Furthermore, our model retains strong capabilities in general multi-image understanding.

Related papers

More Images, More Problems? A Controlled Analysis of VLM Failure Modes [80.64323947730905]
Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities, yet their proficiency in understanding and reasoning over multiple images remains largely unexplored.<n>We introduce MIMIC, a new benchmark designed to rigorously evaluate the multi-image capabilities of LVLMs.
arXiv Detail & Related papers (2026-01-12T18:45:13Z)
OneThinker: All-in-one Reasoning Model for Image and Video [45.8205286430071]
We propose OneThinker, an all-in-one reasoning model that unifies image and video understanding across diverse visual tasks.<n>Experiments show that OneThinker delivers strong performance on 31 benchmarks, across 10 fundamental visual understanding tasks.
arXiv Detail & Related papers (2025-12-02T18:59:52Z)
Query-Kontext: An Unified Multimodal Model for Image Generation and Editing [53.765351127477224]
Unified Multimodal Models (UMMs) have demonstrated remarkable performance in text-to-image generation (T2I) and editing (TI2I)<n>We introduce Query-Kontext, a novel approach that bridges the VLM and diffusion model via a multimodal kontext'' composed of semantic cues and coarse-grained image conditions encoded from multimodal inputs.<n> Experiments show that our approach matches strong unified baselines and even outperforms task-specific state-of-the-art methods in several cases.
arXiv Detail & Related papers (2025-09-30T17:59:46Z)
MIRG-RL: Multi-Image Reasoning and Grounding with Reinforcement Learning [10.049259114211663]
Current Large Visual Language Models (LVLMs) face two critical challenges: the lack of cross-image reasoning capabilities and insufficient cross-image reference reward modeling.<n>We propose a unified framework - Multi-Image Reasoning and Grounding with Reinforcement Learning (MIRG-RL)<n>Specifically, our two-stage training paradigm combines supervised fine-tuning with annotated trajectories and image-aware reinforcement learning optimization.
arXiv Detail & Related papers (2025-09-26T02:43:22Z)
Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning [28.111812077758845]
Multimodal Large Language Models (MLLMs) excel at visual grounding in single-image scenarios with textual references.<n>However, their performance degrades when handling real-world applications that involve complex multi-image compositions and multi-modal instructions.<n>We adopt a Reinforcement Learning based post-training strategy to improve the reasoning of MLLMs in multi-image grounding tasks.
arXiv Detail & Related papers (2025-07-01T13:48:57Z)
PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning [50.21619363035618]
We propose a general reinforcement learning approach PeRL tailored for interleaved multimodal tasks.<n>We introduce permutation of image sequences to simulate varied positional relationships to explore more spatial and positional diversity.<n>Our experiments confirm that PeRL trained model consistently surpasses R1-related and interleaved VLM baselines by a large margin.
arXiv Detail & Related papers (2025-06-17T18:25:56Z)
UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning [30.073631823776825]
We propose UniVG-R1, a reasoning guided multimodal large language model (MLLM) for universal visual grounding.<n>We first construct a high-quality Chain-of-Thought grounding dataset, annotated with detailed reasoning chains.<n>We then perform rule-based reinforcement learning to encourage the model to identify correct reasoning chains, thereby incentivizing its reasoning capabilities.
arXiv Detail & Related papers (2025-05-20T11:40:43Z)
Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models [79.59567114769513]
We introduce Migician, the first multi-image grounding model capable of performing free-form and accurate grounding across multiple images.<n>Our model achieves significantly superior multi-image grounding capabilities, outperforming the best existing MLLMs by 24.94% and even surpassing much larger 70B models.
arXiv Detail & Related papers (2025-01-10T07:56:23Z)
HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding [80.85164509232261]
HiVG consists of a multi-layer adaptive cross-modal bridge and a hierarchical multimodal low-rank adaptation (HiLoRA) paradigm. HiLoRA prevents the accumulation of perceptual errors by adapting the cross-modal features from shallow to deep layers in a hierarchical manner.
arXiv Detail & Related papers (2024-04-20T14:57:31Z)
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models [86.478087039015]
We present a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings. Based on our proposed joint mixing, we propose an efficient strategy aiming to better capture fine-grained appearances of high-resolution images. We hope our work may cast a light on the exploration of joint mixing in future MLLM research.
arXiv Detail & Related papers (2023-11-13T18:59:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.