STVG-R1: Incentivizing Instance-Level Reasoning and Grounding in Videos via Reinforcement Learning
- URL: http://arxiv.org/abs/2602.11730v1
- Date: Thu, 12 Feb 2026 08:53:32 GMT
- Title: STVG-R1: Incentivizing Instance-Level Reasoning and Grounding in Videos via Reinforcement Learning
- Authors: Xiaowen Zhang, Zhi Gao, Licheng Jiao, Lingling Li, Qing Li,
- Abstract summary: In vision-language models (VLMs), misalignment between textual descriptions and visual coordinates often induces hallucinations.<n>We propose a novel visual prompting paradigm that avoids the difficult problem of aligning coordinates across modalities.<n>We introduce STVG-R1, the first reinforcement learning framework for STVG, which employs a task-driven reward to jointly optimize temporal accuracy, spatial consistency, and structural format regularization.
- Score: 65.36458157092207
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In vision-language models (VLMs), misalignment between textual descriptions and visual coordinates often induces hallucinations. This issue becomes particularly severe in dense prediction tasks such as spatial-temporal video grounding (STVG). Prior approaches typically focus on enhancing visual-textual alignment or attaching auxiliary decoders. However, these strategies inevitably introduce additional trainable modules, leading to significant annotation costs and computational overhead. In this work, we propose a novel visual prompting paradigm that avoids the difficult problem of aligning coordinates across modalities. Specifically, we reformulate per-frame coordinate prediction as a compact instance-level identification problem by assigning each object a unique, temporally consistent ID. These IDs are embedded into the video as visual prompts, providing explicit and interpretable inputs to the VLMs. Furthermore, we introduce STVG-R1, the first reinforcement learning framework for STVG, which employs a task-driven reward to jointly optimize temporal accuracy, spatial consistency, and structural format regularization. Extensive experiments on six benchmarks demonstrate the effectiveness of our approach. STVG-R1 surpasses the baseline Qwen2.5-VL-7B by a remarkable margin of 20.9% on m_IoU on the HCSTVG-v2 benchmark, establishing a new state of the art (SOTA). Surprisingly, STVG-R1 also exhibits strong zero-shot generalization to multi-object referring video object segmentation tasks, achieving a SOTA 47.3% J&F on MeViS.
Related papers
- BREATH-VL: Vision-Language-Guided 6-DoF Bronchoscopy Localization via Semantic-Geometric Fusion [7.382475458362566]
We present BREATH-VL, a hybrid framework that integrates semantic cues from vision-language models with geometric information from registration methods for accurate 6-DoF pose estimation.<n>Building on this, our BREATH-VL outperforms state-of-the-art vision-only localization methods in both accuracy and generalization, reducing translational error by 25.5% compared with the best-performing baseline.
arXiv Detail & Related papers (2026-01-07T09:00:52Z) - VOST-SGG: VLM-Aided One-Stage Spatio-Temporal Scene Graph Generation [18.15310805625469]
VOST-SGG is a VLM-aided one-stage ST-SGG framework that integrates the common sense reasoning capabilities of vision-language models.<n>We propose a multi-modal feature bank that fuses visual, textual, and spatial cues for improved predicate classification.<n>Our approach achieves state-of-the-art performance, validating the effectiveness of integrating VLM-aided semantic priors and multi-modal features for ST-SGG.
arXiv Detail & Related papers (2025-12-05T08:34:06Z) - Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning [41.30900315121155]
multimodal large language models (LMs) underperform on STVG due to misaligned training objectives and weak fine-grained fine-word alignment in standard visual encoders.<n>We propose STVG-o1, the first framework that enables off-the-shelf MLLMs to achieve state-of-the-temporal STVG performance without architectural modifications.
arXiv Detail & Related papers (2025-11-26T13:21:15Z) - ViSS-R1: Self-Supervised Reinforcement Video Reasoning [84.1180294023835]
We introduce a novel self-supervised reinforcement learning GRPO algorithm (Pretext-GRPO) within the standard R1 pipeline.<n>We also propose the ViSS-R1 framework, which streamlines and integrates pretext-task-based self-supervised learning directly into the MLLM's R1 post-training paradigm.
arXiv Detail & Related papers (2025-11-17T07:00:42Z) - SVQA-R1: Reinforcing Spatial Reasoning in MLLMs via View-Consistent Reward Optimization [57.484274282231226]
We propose SVQA-R1, the first framework to extend R1-style training to spatial VQA.<n>In particular, we introduce Spatial-GRPO, a novel group-wise RL strategy that constructs view-consistent rewards by perturbing spatial relations between objects.<n>Our model, SVQA-R1, not only dramatically improved accuracy on spatial VQA benchmarks but also exhibits interpretable reasoning paths even without using supervised fine-tuning data.
arXiv Detail & Related papers (2025-06-02T06:58:43Z) - Collaborative Temporal Consistency Learning for Point-supervised Natural Language Video Localization [129.43937834515688]
We propose a new COllaborative Temporal consistEncy Learning (COTEL) framework to strengthen the video-language alignment.<n>Specifically, we first design a frame- and a segment-level Temporal Consistency Learning (TCL) module that models semantic alignment across frame saliencies and sentence-moment pairs.
arXiv Detail & Related papers (2025-03-22T05:04:12Z) - Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment [130.15775113897553]
Finsta is a fine-grained structural-temporal alignment learning method.
It consistently improves the existing 13 strong-tuning video-language models.
arXiv Detail & Related papers (2024-06-27T15:23:36Z) - Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models [58.17315970207874]
We propose a zero-shot method for adapting generalisable visual-textual priors from arbitrary VLM to facilitate moment-text alignment.
Experiments conducted on three VMR benchmark datasets demonstrate the notable performance advantages of our zero-shot algorithm.
arXiv Detail & Related papers (2023-09-01T13:06:50Z) - Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video
Grounding [35.73830796500975]
We present an end-to-end one-stage framework, termed Spatio-Temporal Consistency-Aware Transformer (STCAT)
To generate the above template under sufficient video- perception, an encoder-decoder architecture is proposed for effective global context modeling.
Our method outperforms previous state-of-the-arts with clear margins on two challenging video benchmarks.
arXiv Detail & Related papers (2022-09-27T11:13:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.