VG-Refiner: Towards Tool-Refined Referring Grounded Reasoning via Agentic Reinforcement Learning
- URL: http://arxiv.org/abs/2512.06373v1
- Date: Sat, 06 Dec 2025 09:58:01 GMT
- Title: VG-Refiner: Towards Tool-Refined Referring Grounded Reasoning via Agentic Reinforcement Learning
- Authors: Yuji Wang, Wenlong Liu, Jingxuan Niu, Haoji Zhang, Yansong Tang,
- Abstract summary: VG-Refiner is first framework aiming at the tool-refined referring grounded reasoning.<n>We introduce a two-stage think-rethink mechanism that enables the model to explicitly analyze and respond to tool feedback.<n>We achieve a significant improvement in accuracy and correction ability on referring and reasoning grounding benchmarks.
- Score: 35.49623258104176
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Tool-integrated visual reasoning (TiVR) has demonstrated great potential in enhancing multimodal problem-solving. However, existing TiVR paradigms mainly focus on integrating various visual tools through reinforcement learning, while neglecting to design effective response mechanisms for handling unreliable or erroneous tool outputs. This limitation is particularly pronounced in referring and grounding tasks, where inaccurate detection tool predictions often mislead TiVR models into generating hallucinated reasoning. To address this issue, we propose the VG-Refiner, the first framework aiming at the tool-refined referring grounded reasoning. Technically, we introduce a two-stage think-rethink mechanism that enables the model to explicitly analyze and respond to tool feedback, along with a refinement reward that encourages effective correction in response to poor tool results. In addition, we propose two new metrics and establish fair evaluation protocols to systematically measure the refinement ability of current models. We adopt a small amount of task-specific data to enhance the refinement capability of VG-Refiner, achieving a significant improvement in accuracy and correction ability on referring and reasoning grounding benchmarks while preserving the general capabilities of the pretrained model.
Related papers
- Reasoning and Tool-use Compete in Agentic RL:From Quantifying Interference to Disentangled Tuning [26.401906729658688]
Agentic Reinforcement Learning (ARL) focuses on training large language models to interleave reasoning with external tool execution to solve complex tasks.<n>Most existing ARL methods train a single shared model parameters to support both reasoning and tool use behaviors, implicitly assuming that joint training leads to improved overall agent performance.<n>We show that these two capabilities often induce misaligned gradient directions, leading to training interference that undermines the effectiveness of joint optimization.<n>We propose Disentangled Action Reasoning Tuning(DART), a simple and efficient framework that explicitly decouples parameter updates for reasoning and tool-use via separate low-rank
arXiv Detail & Related papers (2026-02-01T03:19:22Z) - From Passive Metric to Active Signal: The Evolving Role of Uncertainty Quantification in Large Language Models [77.04403907729738]
This survey charts the evolution of uncertainty from a passive diagnostic metric to an active control signal guiding real-time model behavior.<n>We demonstrate how uncertainty is leveraged as an active control signal across three frontiers.<n>This survey argues that mastering the new trend of uncertainty is essential for building the next generation of scalable, reliable, and trustworthy AI.
arXiv Detail & Related papers (2026-01-22T06:21:31Z) - ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning [103.7657839292775]
ARM-Thinker is an Agentic multimodal Reward Model that autonomously invokes external tools to ground judgments in verifiable evidence.<n>We train ARM-Thinker with multi-stage reinforcement learning, jointly optimizing tool-calling decisions and judgment accuracy.<n>Our results demonstrate that agentic capabilities significantly enhance both accuracy and interpretability of reward models.
arXiv Detail & Related papers (2025-12-04T18:59:52Z) - Beyond Reasoning Gains: Mitigating General Capabilities Forgetting in Large Reasoning Models [33.214586668992965]
Reinforcement learning with verifiable rewards (RLVR) has delivered impressive gains in mathematical and multimodal reasoning.<n>We propose RECAP-a replay strategy with dynamic objective reweighting for general knowledge.<n>Our method is end-to-end and readily applicable to existing RLVR pipelines without training additional models or heavy tuning.
arXiv Detail & Related papers (2025-10-24T19:08:48Z) - Beyond Model Ranking: Predictability-Aligned Evaluation for Time Series Forecasting [18.018179328110048]
We introduce a predictability-aligned diagnostic framework grounded in spectral coherence.<n>We provide the first systematic evidence of "predictability drift", demonstrating that a task's forecasting difficulty varies sharply over time.<n>Our evaluation reveals a key architectural trade-off: complex models are superior for low-predictability data, whereas linear models are highly effective on more predictable tasks.
arXiv Detail & Related papers (2025-09-27T02:56:06Z) - Rethinking Evaluation of Infrared Small Target Detection [105.59753496831739]
This paper introduces a hybrid-level metric incorporating pixel- and target-level performance, proposing a systematic error analysis method, and emphasizing the importance of cross-dataset evaluation.<n>An open-source toolkit has be released to facilitate standardized benchmarking.
arXiv Detail & Related papers (2025-09-21T02:45:07Z) - Sycophancy Mitigation Through Reinforcement Learning with Uncertainty-Aware Adaptive Reasoning Trajectories [58.988535279557546]
We introduce textbf sycophancy Mitigation through Adaptive Reasoning Trajectories.<n>We show that SMART significantly reduces sycophantic behavior while preserving strong performance on out-of-distribution inputs.
arXiv Detail & Related papers (2025-09-20T17:09:14Z) - Perception-Aware Policy Optimization for Multimodal Reasoning [79.56070395437898]
A major source of error in current multimodal reasoning lies in the perception of visual inputs.<n>We propose PAPO, a novel policy gradient algorithm that encourages the model to learn to perceive while learning to reason.<n>We observe a substantial reduction of 30.5% in perception errors, indicating improved perceptual capabilities with PAPO.
arXiv Detail & Related papers (2025-07-08T23:22:34Z) - Risk Analysis and Design Against Adversarial Actions [1.9573380763700716]
We propose a versatile, well-principled framework to evaluate the model's robustness against attacks of diverse types and intensities.<n>Results enable an assessment of the model vulnerability without requiring additional test data and operate in a distribution-free setup.
arXiv Detail & Related papers (2025-05-02T09:16:44Z) - Enhancing Relation Extraction via Supervised Rationale Verification and Feedback [12.687458877141934]
We propose a novel automated feedback framework for relation extraction.<n>It presents a rationale supervisor to verify the rationale and provides re-selected demonstrations as feedback to correct the initial prediction.<n>Our proposed framework significantly outperforms existing methods.
arXiv Detail & Related papers (2024-12-10T08:18:29Z) - Explanatory Model Monitoring to Understand the Effects of Feature Shifts on Performance [61.06245197347139]
We propose a novel approach to explain the behavior of a black-box model under feature shifts.
We refer to our method that combines concepts from Optimal Transport and Shapley Values as Explanatory Performance Estimation.
arXiv Detail & Related papers (2024-08-24T18:28:19Z) - OVM, Outcome-supervised Value Models for Planning in Mathematical Reasoning [15.59540726867483]
We argue that in guided decoding, assessing the potential of an incomplete reasoning path can be more advantageous than simply ensuring per-step correctness.
Inspired by the findings that $textitoutcome supervision for guided decoding essentially acts as a value model, we propose Outcome-supervised Value Model (OVM)
Our experiments on two multi-step mathematical reasoning datasets, GSM8K and Game of 24, demonstrate the superior performance of the OVM model.
arXiv Detail & Related papers (2023-11-16T09:56:28Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.