Related papers: VGent: Visual Grounding via Modular Design for Disentangling Reasoning and Prediction

VGent: Visual Grounding via Modular Design for Disentangling Reasoning and Prediction

URL: http://arxiv.org/abs/2512.11099v1
Date: Thu, 11 Dec 2025 20:21:00 GMT
Title: VGent: Visual Grounding via Modular Design for Disentangling Reasoning and Prediction
Authors: Weitai Kang, Jason Kuen, Mengwei Ren, Zijun Wei, Yan Yan, Kangning Liu,
Abstract summary: VGent is a modular encoder-decoder architecture that disentangles high-level reasoning and low-level bounding box prediction.<n>We show that VGent achieves a new state-of-the-art with +20.6% F1 improvement over prior methods.
Score: 23.814125316335154
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current visual grounding models are either based on a Multimodal Large Language Model (MLLM) that performs auto-regressive decoding, which is slow and risks hallucinations, or on re-aligning an LLM with vision features to learn new special or object tokens for grounding, which may undermine the LLM's pretrained reasoning ability. In contrast, we propose VGent, a modular encoder-decoder architecture that explicitly disentangles high-level reasoning and low-level bounding box prediction. Specifically, a frozen MLLM serves as the encoder to provide untouched powerful reasoning capabilities, while a decoder takes high-quality boxes proposed by detectors as queries and selects target box(es) via cross-attending on encoder's hidden states. This design fully leverages advances in both object detection and MLLM, avoids the pitfalls of auto-regressive decoding, and enables fast inference. Moreover, it supports modular upgrades of both the encoder and decoder to benefit the whole system: we introduce (i) QuadThinker, an RL-based training paradigm for enhancing multi-target reasoning ability of the encoder; (ii) mask-aware label for resolving detection-segmentation ambiguity; and (iii) global target recognition to improve the recognition of all the targets which benefits the selection among augmented proposals. Experiments on multi-target visual grounding benchmarks show that VGent achieves a new state-of-the-art with +20.6% F1 improvement over prior methods, and further boosts gIoU by +8.2% and cIoU by +5.8% under visual reference challenges, while maintaining constant, fast inference latency.

Related papers

From One-to-One to Many-to-Many: Dynamic Cross-Layer Injection for Deep Vision-Language Fusion [91.35078719566472]
Vision-Language Models (VLMs) create a severe visual feature bottleneck by using a crude, asymmetric connection.<n>We introduce Cross-Layer Injection (CLI), a novel and lightweight framework that forges a dynamic many-to-many bridge between the two modalities.
arXiv Detail & Related papers (2026-01-15T18:59:10Z)
Fusion to Enhance: Fusion Visual Encoder to Enhance Multimodal Language Model [1.3663057923522652]
We introduce Fusion to Enhance (FtZ), a novel vision tower framework.<n>FtZ moves beyond the single-encoder design by innovatively composing a semantically powerful anchor encoder with a perception-rich augmenting encoder.<n>This work proves that composing heterogeneous expert encoders is an efficient and effective path to overcoming the visual perception bottleneck in current MLLMs.
arXiv Detail & Related papers (2025-08-31T02:22:57Z)
ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding [71.654781631463]
ReAgent-V is a novel agentic video understanding framework.<n>It integrates efficient frame selection with real-time reward generation during inference.<n>Extensive experiments on 12 datasets demonstrate significant gains in generalization and reasoning.
arXiv Detail & Related papers (2025-06-02T04:23:21Z)
Training-Free Reasoning and Reflection in MLLMs [45.134271969594614]
This paper introduces FRANK Model, a training-FRee ANd r1-liKe MLLM that imbues off-the-shelf MLLMs with reasoning and reflection abilities.<n>Our key insight is to decouple perception and reasoning across MLLM decoder layers.<n>To this end, we propose a layer-wise, Taylor-derived closed-form fusion mechanism that integrates reasoning capacity into deep decoder layers.
arXiv Detail & Related papers (2025-05-22T02:51:12Z)
ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning [68.76048244253582]
We introduce ViaRL, the first framework to leverage rule-based reinforcement learning (RL) for optimizing frame selection in video understanding.<n>ViaRL utilizes the answer accuracy of a downstream model as a reward signal to train a frame selector through trial-and-error.<n>ViaRL consistently delivers superior temporal grounding performance and robust generalization across diverse video understanding tasks.
arXiv Detail & Related papers (2025-05-21T12:29:40Z)
Semi-supervised Semantic Segmentation with Multi-Constraint Consistency Learning [81.02648336552421]
We propose a Multi-Constraint Consistency Learning approach to facilitate the staged enhancement of the encoder and decoder.<n>Self-adaptive feature masking and noise injection are designed in an instance-specific manner to perturb the features for robust learning of the decoder.<n> Experimental results on Pascal VOC2012 and Cityscapes datasets demonstrate that our proposed MCCL achieves new state-of-the-art performance.
arXiv Detail & Related papers (2025-03-23T03:21:33Z)
EVEv2: Improved Baselines for Encoder-Free Vision-Language Models [72.07868838411474]
Existing encoder-free vision-language models (VLMs) are narrowing the performance gap with their encoder-based counterparts.<n>We develop efficient strategies for encoder-free VLMs that rival mainstream encoder-based ones.<n>We show that properly and hierarchically associating vision and language within a unified model reduces interference between modalities.
arXiv Detail & Related papers (2025-02-10T18:59:58Z)
LAMBO: Large AI Model Empowered Edge Intelligence [71.56135386994119]
Next-generation edge intelligence is anticipated to benefit various applications via offloading techniques. Traditional offloading architectures face several issues, including heterogeneous constraints, partial perception, uncertain generalization, and lack of tractability. We propose a Large AI Model-Based Offloading (LAMBO) framework with over one billion parameters for solving these problems.
arXiv Detail & Related papers (2023-08-29T07:25:42Z)
Learning to Learn Better for Video Object Segmentation [94.5753973590207]
We propose a novel framework that emphasizes Learning to Learn Better (LLB) target features for SVOS. We design the discriminative label generation module (DLGM) and the adaptive fusion module to address these issues. Our proposed LLB method achieves state-of-the-art performance.
arXiv Detail & Related papers (2022-12-05T09:10:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.