Temporal-Conditional Referring Video Object Segmentation with Noise-Free Text-to-Video Diffusion Model
- URL: http://arxiv.org/abs/2508.13584v1
- Date: Tue, 19 Aug 2025 07:36:04 GMT
- Title: Temporal-Conditional Referring Video Object Segmentation with Noise-Free Text-to-Video Diffusion Model
- Authors: Ruixin Zhang, Jiaqing Fan, Yifan Liao, Qian Qiao, Fanzhang Li,
- Abstract summary: Referring Video Object (RVOS) aims to segment specific objects in a video according to textual descriptions.<n>We observe that recent RVOS approaches often place excessive emphasis on feature extraction and temporal modeling, while relatively neglecting the design of the segmentation head.<n>We propose a TemporalConditional Referring Video Object model, which integrates existing segmentation methods to enhance boundary segmentation capability.
- Score: 4.848917027477984
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Referring Video Object Segmentation (RVOS) aims to segment specific objects in a video according to textual descriptions. We observe that recent RVOS approaches often place excessive emphasis on feature extraction and temporal modeling, while relatively neglecting the design of the segmentation head. In fact, there remains considerable room for improvement in segmentation head design. To address this, we propose a Temporal-Conditional Referring Video Object Segmentation model, which innovatively integrates existing segmentation methods to effectively enhance boundary segmentation capability. Furthermore, our model leverages a text-to-video diffusion model for feature extraction. On top of this, we remove the traditional noise prediction module to avoid the randomness of noise from degrading segmentation accuracy, thereby simplifying the model while improving performance. Finally, to overcome the limited feature extraction capability of the VAE, we design a Temporal Context Mask Refinement (TCMR) module, which significantly improves segmentation quality without introducing complex designs. We evaluate our method on four public RVOS benchmarks, where it consistently achieves state-of-the-art performance.
Related papers
- Training-Free Spatio-temporal Decoupled Reasoning Video Segmentation with Adaptive Object Memory [10.183518059286124]
Reasoning Video Object (VOS) is a challenging task that requires stable object segmentation across video sequences.<n>Previous methods fine-tune Multimodal Large Language Models (MLLMs) to produce segmentation outputs, which demand substantial resources.<n>We propose Training-Free textbfStemporal textbfDecoupled Reasoning Video with textbfAdaptive Object bfMemory (SDAM)<n>Our method achieves excellent results on five benchmark datasets, including Ref-YouTubeVOS, RefDAVIS17, MeViViS, ReasonVOS
arXiv Detail & Related papers (2026-03-02T07:15:41Z) - Temporal Prompting Matters: Rethinking Referring Video Object Segmentation [64.82333675385802]
Referring Video Object (RVOS) aims to segment the object referred to by the query sentence in the video.<n>Most existing methods require end-to-end training with dense mask annotations.<n>We propose a Temporal Prompt Generation and Selection (Tenet) framework to address the referring and video factors.
arXiv Detail & Related papers (2025-10-08T17:59:57Z) - Motion-Aware Concept Alignment for Consistent Video Editing [57.08108545219043]
We introduce MoCA-Video (Motion-Aware Concept Alignment in Video), a training-free framework bridging the gap between image-domain semantic mixing and video.<n>Given a generated video and a user-provided reference image, MoCA-Video injects the semantic features of the reference image into a specific object within the video.<n>We evaluate MoCA's performance using the standard SSIM, image-level LPIPS, temporal LPIPS, and introduce a novel metric CASS (Conceptual Alignment Shift Score) to evaluate the consistency and effectiveness of the visual shifts between the source prompt and the modified video frames
arXiv Detail & Related papers (2025-06-01T13:28:04Z) - Rethinking Video Segmentation with Masked Video Consistency: Did the Model Learn as Intended? [22.191260650245443]
Video segmentation aims at partitioning video sequences into meaningful segments based on objects or regions of interest within frames.
Current video segmentation models are often derived from image segmentation techniques, which struggle to cope with small-scale or class-imbalanced video datasets.
We propose a training strategy Masked Video Consistency, which enhances spatial and temporal feature aggregation.
arXiv Detail & Related papers (2024-08-20T08:08:32Z) - MCDS-VSS: Moving Camera Dynamic Scene Video Semantic Segmentation by Filtering with Self-Supervised Geometry and Motion [17.50161162624179]
Self-driving cars rely on reliable semantic environment perception for decision making.
We propose MCDS-VSS, a structured filter model that learns in a self-supervised manner to estimate scene geometry and ego-motion of the camera.
Our model parses automotive scenes into multiple interpretable representations such as scene geometry, ego-motion, and object motion.
arXiv Detail & Related papers (2024-05-30T10:33:14Z) - Temporally Consistent Referring Video Object Segmentation with Hybrid Memory [98.80249255577304]
We propose an end-to-end R-VOS paradigm that explicitly models temporal consistency alongside the referring segmentation.
Features of frames with automatically generated high-quality reference masks are propagated to segment remaining frames.
Extensive experiments demonstrate that our approach enhances temporal consistency by a significant margin.
arXiv Detail & Related papers (2024-03-28T13:32:49Z) - Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation.
We introduce a scalable pipeline for generating synthetic training data with multiple objects.
We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z) - Implicit Motion Handling for Video Camouflaged Object Detection [60.98467179649398]
We propose a new video camouflaged object detection (VCOD) framework.
It can exploit both short-term and long-term temporal consistency to detect camouflaged objects from video frames.
arXiv Detail & Related papers (2022-03-14T17:55:41Z) - Improving Semantic Segmentation through Spatio-Temporal Consistency
Learned from Videos [39.25927216187176]
We leverage unsupervised learning of depth, egomotion, and camera intrinsics to improve single-image semantic segmentation.
The predicted depth, egomotion, and camera intrinsics are used to provide an additional supervision signal to the segmentation model.
arXiv Detail & Related papers (2020-04-11T07:09:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.