Reinforcing Video Reasoning Segmentation to Think Before It Segments
- URL: http://arxiv.org/abs/2508.11538v1
- Date: Fri, 15 Aug 2025 15:34:56 GMT
- Title: Reinforcing Video Reasoning Segmentation to Think Before It Segments
- Authors: Sitong Gong, Lu Zhang, Yunzhi Zhuge, Xu Jia, Pingping Zhang, Huchuan Lu,
- Abstract summary: We introduce Veason-R1, a specialized LVLM for video reasoning segmentation.<n>Veason-R1 is trained through Group Relative Policy Optimization (O) augmented with Chain-of-Thought trajectories.<n>We incorporate a holistic reward mechanism that enhances spatial alignment and temporal consistency.<n>Veason-R1 achieves state-of-the-art performance on multiple benchmarks, surpassing prior art by significant margins.
- Score: 67.5703457389657
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video reasoning segmentation (VRS) endeavors to delineate referred objects in videos guided by implicit instructions that encapsulate human intent and temporal logic. Previous approaches leverage large vision language models (LVLMs) to encode object semantics into <SEG> tokens for mask prediction. However, this paradigm suffers from limited interpretability during inference and suboptimal performance due to inadequate spatiotemporal reasoning. Drawing inspiration from seminal breakthroughs in reinforcement learning, we introduce Veason-R1, a specialized LVLM for VRS that emphasizes structured reasoning in segmentation. Veason-R1 is trained through Group Relative Policy Optimization (GRPO) augmented with Chain-of-Thought (CoT) initialization. To begin with, we curate high-quality CoT training data to instill structured reasoning trajectories, bridging video-level semantics and frame-level spatial grounding, yielding the supervised fine-tuned model Veason-SFT. Subsequently, GRPO fine-tuning encourages efficient exploration of the reasoning space by optimizing reasoning chains. To this end, we incorporate a holistic reward mechanism that synergistically enhances spatial alignment and temporal consistency, bolstering keyframe localization and fine-grained grounding. Comprehensive empirical evaluations demonstrate that Veason-R1 achieves state-of-the-art performance on multiple benchmarks, surpassing prior art by significant margins (e.g., +1.3 J &F in ReVOS and +10.0 J &F in ReasonVOS), while exhibiting robustness to hallucinations (+8.8 R). Our code and model weights will be available at Veason-R1.
Related papers
- Understanding Degradation with Vision Language Model [56.09241449206817]
Understanding visual degradations is a critical yet challenging problem in computer vision.<n>We introduce DU-VLM, a multimodal chain-of-thought model trained with supervised fine-tuning and reinforcement learning.<n>We also introduce textbfDU-110k, a large-scale dataset comprising 110,000 clean-degraded pairs with grounded physical annotations.
arXiv Detail & Related papers (2026-02-04T13:51:15Z) - Bridging Semantics and Geometry: A Decoupled LVLM-SAM Framework for Reasoning Segmentation in Remote Sensing [8.731693840957716]
Think2Seg-RS is a framework that trains an LVLM prompter to control a frozen Segment Anything Model (SAM) via structured geometric prompts.<n>The framework achieves state-of-the-art performance on the EarthReason dataset.<n> compact segmenters outperform larger ones under semantic-level supervision, and that negative prompts are ineffective in heterogeneous aerial backgrounds.
arXiv Detail & Related papers (2025-12-22T11:46:42Z) - Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding [54.05243949024302]
Existing robust MLLMs rely on implicit training/adaptation that focuses solely on visual encoder generalization.<n>We propose Robust-R1, a novel framework that explicitly models visual degradations through structured reasoning chains.<n>Our approach integrates: (i) supervised fine-tuning for degradation-aware reasoning foundations, (ii) reward-driven alignment for accurately perceiving degradation parameters, and (iii) dynamic reasoning depth scaling adapted to degradation intensity.
arXiv Detail & Related papers (2025-12-19T12:56:17Z) - ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning [44.49803237328707]
ReVSeg executes reasoning as sequential decisions in the native interface of pretrained vision language models.<n>We employ reinforcement learning to optimize the multi-step reasoning chain, enabling the model to self-refine its decision quality from outcome-driven signals.
arXiv Detail & Related papers (2025-12-02T14:44:12Z) - VisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning [33.42243283912315]
Chain-of-Thought prompting has proven remarkably effective for eliciting complex reasoning in large language models.<n>Existing visual-CoT resources are typically small, domain-specific, or lack the human-like stepwise structure necessary for compositional visual reasoning.<n>We introduce VisReason, a large-scale dataset designed to advance visual Chain-of-Thought reasoning.
arXiv Detail & Related papers (2025-11-21T19:30:24Z) - Unlocking the Essence of Beauty: Advanced Aesthetic Reasoning with Relative-Absolute Policy Optimization [63.169050703903515]
We propose Aes-R1, a comprehensive aesthetic reasoning framework with reinforcement learning (RL)<n>Aes-R1 integrates a pipeline, AesCoT, to construct and filter high-quality chain-of-thought aesthetic reasoning data.<n>Experiments demonstrate that Aes-R1 improves the backbone's average PLCC/SRCC by 47.9%/34.8%.
arXiv Detail & Related papers (2025-09-26T04:55:00Z) - TAR-TVG: Enhancing VLMs with Timestamp Anchor-Constrained Reasoning for Temporal Video Grounding [28.79516973256083]
Temporal Video Grounding aims to precisely localize video segments corresponding to natural language queries.<n>We propose Timestamp Anchor-constrained Reasoning for Temporal Video Grounding (TAR-TVG)<n>TAR-TVG introduces timestamp anchors within the reasoning process to enforce explicit supervision to the thought content.
arXiv Detail & Related papers (2025-08-11T06:59:32Z) - AURORA: Augmented Understanding via Structured Reasoning and Reinforcement Learning for Reference Audio-Visual Segmentation [113.75682363364004]
AURORA is a framework designed to enhance genuine reasoning and language comprehension in reference audio-visual segmentation.<n>AURORA achieves state-of-the-art performance on Ref-AVS benchmarks and generalizes effectively to unreferenced segmentation.
arXiv Detail & Related papers (2025-08-04T07:47:38Z) - Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs [12.883053399582174]
Current vision-language models struggle with fine-grained spatial reasoning.<n>We introduce SpatialReasoner-R1, a vision-language reasoning model designed to address these limitations.<n>We show that fDPO achieves an average improvement of 4.1% over standard DPO across spatial quality tasks, and a 9.0% gain in spatial quantity tasks.
arXiv Detail & Related papers (2025-06-26T18:00:00Z) - Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning [78.17782197231325]
We propose a reasoning-guided reinforcement learning strategy that aligns the extractor's captioning behavior with the reasoning objective.<n> Experiments on multi-modal math and science benchmarks show that the proposed RACRO method achieves state-of-the-art average performance.
arXiv Detail & Related papers (2025-06-05T02:28:07Z) - SVQA-R1: Reinforcing Spatial Reasoning in MLLMs via View-Consistent Reward Optimization [57.484274282231226]
We propose SVQA-R1, the first framework to extend R1-style training to spatial VQA.<n>In particular, we introduce Spatial-GRPO, a novel group-wise RL strategy that constructs view-consistent rewards by perturbing spatial relations between objects.<n>Our model, SVQA-R1, not only dramatically improved accuracy on spatial VQA benchmarks but also exhibits interpretable reasoning paths even without using supervised fine-tuning data.
arXiv Detail & Related papers (2025-06-02T06:58:43Z) - DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models [18.06361678575107]
We propose textbfDINO-R1, the first attempt to incentivize visual in-context reasoning capabilities of vision foundation models.<n>DINO-R1 introduces textbfGroup Relative Query Optimization (GRQO), a novel reinforcement-style training strategy.<n>Experiments on COCO, LVIS, and ODinW demonstrate that DINO-R1 significantly outperforms supervised fine-tuning baselines.
arXiv Detail & Related papers (2025-05-29T21:58:06Z) - Compile Scene Graphs with Reinforcement Learning [69.36723767339001]
Next-token prediction is the fundamental principle for training large language models (LLMs)<n>We introduce R1-SGG, a multimodal LLM (M-LLM) initially trained via supervised fine-tuning (SFT) on the scene graph dataset.<n>We design a set of graph-centric rewards, including three recall-based variants -- Hard Recall, Hard Recall+Relax, and Soft Recall.
arXiv Detail & Related papers (2025-04-18T10:46:22Z) - ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations [33.74746234704817]
Video object segmentation (RVOS) aims to segment target objects throughout a video based on a text description.<n>This is challenging as it involves deep vision-level understanding, pixel-level dense prediction andtemporal reasoning.<n>We propose bfReferDINO RVOS that inherits region-level vision-text alignment from foundational visual grounding models.
arXiv Detail & Related papers (2025-01-24T16:24:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.