Related papers: VideoSeg-R1:Reasoning Video Object Segmentation via Reinforcement Learning

VideoSeg-R1:Reasoning Video Object Segmentation via Reinforcement Learning

URL: http://arxiv.org/abs/2511.16077v1
Date: Thu, 20 Nov 2025 06:12:25 GMT
Title: VideoSeg-R1:Reasoning Video Object Segmentation via Reinforcement Learning
Authors: Zishan Xu, Yifu Guo, Yuquan Lu, Fengyu Yang, Junxin Li,
Abstract summary: VideoSeg-R1 is a framework to introduce reinforcement learning into video reasoning segmentation.<n>It comprises three stages: (1) A hierarchical text-guided frame sampler to emulate human attention; (2) A reasoning model that produces spatial cues along with explicit reasoning chains; and (3) A segmentation-propagation stage using SAM2 and XMem.
Score: 14.065667728414942
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Traditional video reasoning segmentation methods rely on supervised fine-tuning, which limits generalization to out-of-distribution scenarios and lacks explicit reasoning. To address this, we propose \textbf{VideoSeg-R1}, the first framework to introduce reinforcement learning into video reasoning segmentation. It adopts a decoupled architecture that formulates the task as joint referring image segmentation and video mask propagation. It comprises three stages: (1) A hierarchical text-guided frame sampler to emulate human attention; (2) A reasoning model that produces spatial cues along with explicit reasoning chains; and (3) A segmentation-propagation stage using SAM2 and XMem. A task difficulty-aware mechanism adaptively controls reasoning length for better efficiency and accuracy. Extensive evaluations on multiple benchmarks demonstrate that VideoSeg-R1 achieves state-of-the-art performance in complex video reasoning and segmentation tasks. The code will be publicly available at https://github.com/euyis1019/VideoSeg-R1.

Related papers

Invert4TVG: A Temporal Video Grounding Framework with Inversion Tasks for Enhanced Action Understanding [31.472828313904316]
Temporal Video Grounding (TVG) seeks to localize video segments matching a given textual query.<n>Current methods, while optimizing for high temporal Intersection-over-Union (IoU), often overfit to this metric, compromising semantic action understanding in the video and query.<n>We introduce Inversion Tasks for TVG (Invert4TVG), a novel framework that enhances both localization accuracy and action understanding without additional data.
arXiv Detail & Related papers (2025-08-10T15:38:04Z)
Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation [61.37076111486196]
Ref-AVS aims to segment target objects in audible videos based on given reference expressions.<n>We propose TGS-Agent, which decomposes the task into a Think-Ground-Segment process.<n>Ref-Thinker is a multimodal language model capable of reasoning over textual, visual, and auditory cues.
arXiv Detail & Related papers (2025-08-06T13:05:09Z)
CoT-RVS: Zero-Shot Chain-of-Thought Reasoning Segmentation for Videos [59.391265901911005]
We propose CoT-RVS, a novel framework employing the zero-shot Chain-of-Thought (CoT) capability of MLLM to address complex challenges by temporal-semantic reasoning.<n>CoT-RVS analyzes the visible objects within a given frame that possibly match the language query (semantic), and chooses a corresponding for each object that can be observed effortlessly among all frames (temporal)<n>Our framework's training-free feature further allows its extension to process online video streams, where the CoT is used at test time to update the object of interest when a better target starts to emerge
arXiv Detail & Related papers (2025-05-24T07:01:31Z)
Tracking Anything with Decoupled Video Segmentation [87.07258378407289]
We develop a decoupled video segmentation approach (DEVA) It is composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks.
arXiv Detail & Related papers (2023-09-07T17:59:41Z)
Transform-Equivariant Consistency Learning for Temporal Sentence Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video. Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted. In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z)
Multi-Attention Network for Compressed Video Referring Object Segmentation [103.18477550023513]
Referring video object segmentation aims to segment the object referred by a given language expression. Existing works typically require compressed video bitstream to be decoded to RGB frames before being segmented. This may hamper its application in real-world computing resource limited scenarios, such as autonomous cars and drones.
arXiv Detail & Related papers (2022-07-26T03:00:52Z)
The Second Place Solution for The 4th Large-scale Video Object Segmentation Challenge--Track 3: Referring Video Object Segmentation [18.630453674396534]
ReferFormer aims to segment object instances in a given video referred by a language expression in all video frames. This work proposes several tricks to boost further, including cyclical learning rates, semi-supervised approach, and test-time augmentation inference. The improved ReferFormer ranks 2nd place on CVPR2022 Referring Youtube-VOS Challenge.
arXiv Detail & Related papers (2022-06-24T02:15:06Z)
Boundary-aware Self-supervised Learning for Video Scene Segmentation [20.713635723315527]
Video scene segmentation is a task of temporally localizing scene boundaries in a video. We introduce three novel boundary-aware pretext tasks: Shot-Scene Matching, Contextual Group Matching and Pseudo-boundary Prediction. We achieve the new state-of-the-art on the MovieNet-SSeg benchmark.
arXiv Detail & Related papers (2022-01-14T02:14:07Z)
Joint Inductive and Transductive Learning for Video Object Segmentation [107.32760625159301]
Semi-supervised object segmentation is a task of segmenting the target object in a video sequence given only a mask in the first frame. Most previous best-performing methods adopt matching-based transductive reasoning or online inductive learning. We propose to integrate transductive and inductive learning into a unified framework to exploit complement between them for accurate and robust video object segmentation.
arXiv Detail & Related papers (2021-08-08T16:25:48Z)
Reference-Aided Part-Aligned Feature Disentangling for Video Person Re-Identification [18.13546384207381]
We propose a textbfReference-textbfAided textbfPart-textbfAligned (textbfRAPA) framework to disentangle robust features of different parts. By using both modules, the informative parts of pedestrian in videos are well aligned and more discriminative feature representation is generated.
arXiv Detail & Related papers (2021-03-21T06:53:57Z)
Temporally-Weighted Hierarchical Clustering for Unsupervised Action Segmentation [96.67525775629444]
Action segmentation refers to inferring boundaries of semantically consistent visual concepts in videos. We present a fully automatic and unsupervised approach for segmenting actions in a video that does not require any training. Our proposal is an effective temporally-weighted hierarchical clustering algorithm that can group semantically consistent frames of the video.
arXiv Detail & Related papers (2021-03-20T23:30:01Z)
ALBA : Reinforcement Learning for Video Object Segmentation [11.29255792513528]
We consider the challenging problem of zero-shot video object segmentation (VOS) We treat this as a grouping problem by exploiting object proposals and making a joint inference about grouping over both space and time. We show that the proposed method, which we call ALBA, outperforms the previous stateof-the-art on three benchmarks.
arXiv Detail & Related papers (2020-05-26T20:57:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.