Related papers: Deforming Videos to Masks: Flow Matching for Referring Video Segmentation

Deforming Videos to Masks: Flow Matching for Referring Video Segmentation

URL: http://arxiv.org/abs/2510.06139v1
Date: Tue, 07 Oct 2025 17:14:10 GMT
Title: Deforming Videos to Masks: Flow Matching for Referring Video Segmentation
Authors: Zanyi Wang, Dengyang Jiang, Liuzhuozheng Li, Sizhe Dang, Chengzu Li, Harry Yang, Guang Dai, Mengmeng Wang, Jingdong Wang,
Abstract summary: FlowRVS is a novel framework that reconceptualizes RVOS as a conditional continuous flow problem.<n>We reformulate the task by learning a direct, language-guided deformation from a video's holistic representation to its target mask.<n>Our one-stage, generative approach achieves new state-of-the-art results across all major RVOS benchmarks.
Score: 46.416906762916305
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Referring Video Object Segmentation (RVOS) requires segmenting specific objects in a video guided by a natural language description. The core challenge of RVOS is to anchor abstract linguistic concepts onto a specific set of pixels and continuously segment them through the complex dynamics of a video. Faced with this difficulty, prior work has often decomposed the task into a pragmatic `locate-then-segment' pipeline. However, this cascaded design creates an information bottleneck by simplifying semantics into coarse geometric prompts (e.g, point), and struggles to maintain temporal consistency as the segmenting process is often decoupled from the initial language grounding. To overcome these fundamental limitations, we propose FlowRVS, a novel framework that reconceptualizes RVOS as a conditional continuous flow problem. This allows us to harness the inherent strengths of pretrained T2V models, fine-grained pixel control, text-video semantic alignment, and temporal coherence. Instead of conventional generating from noise to mask or directly predicting mask, we reformulate the task by learning a direct, language-guided deformation from a video's holistic representation to its target mask. Our one-stage, generative approach achieves new state-of-the-art results across all major RVOS benchmarks. Specifically, achieving a $\mathcal{J}\&\mathcal{F}$ of 51.1 in MeViS (+1.6 over prior SOTA) and 73.3 in the zero shot Ref-DAVIS17 (+2.7), demonstrating the significant potential of modeling video understanding tasks as continuous deformation processes.

Related papers

Temporal Prompting Matters: Rethinking Referring Video Object Segmentation [64.82333675385802]
Referring Video Object (RVOS) aims to segment the object referred to by the query sentence in the video.<n>Most existing methods require end-to-end training with dense mask annotations.<n>We propose a Temporal Prompt Generation and Selection (Tenet) framework to address the referring and video factors.
arXiv Detail & Related papers (2025-10-08T17:59:57Z)
Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder [5.57393627015653]
Video segmenter and grounder approaches, exemplified by Sa2VA, directly fuse features within segmentation models.<n>This often results in an undesirable entanglement of dynamic visual information and static semantics, thereby degrading segmentation accuracy.<n>We propose DeSa2VA, a decoupling-enhanced prompting scheme integrating text pre-training and a linear decoupling module to address the information processing limitations inherent in SAM-2.
arXiv Detail & Related papers (2025-06-28T13:30:36Z)
ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations [33.74746234704817]
Video object segmentation (RVOS) aims to segment target objects throughout a video based on a text description.<n>This is challenging as it involves deep vision-level understanding, pixel-level dense prediction andtemporal reasoning.<n>We propose bfReferDINO RVOS that inherits region-level vision-text alignment from foundational visual grounding models.
arXiv Detail & Related papers (2025-01-24T16:24:15Z)
SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method. We distribute features of space-time tubes evenly across a limited number of learnable clusters. Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z)
Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation [72.90144343056227]
We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks. We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model. Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T17:59:58Z)
Multi-Attention Network for Compressed Video Referring Object Segmentation [103.18477550023513]
Referring video object segmentation aims to segment the object referred by a given language expression. Existing works typically require compressed video bitstream to be decoded to RGB frames before being segmented. This may hamper its application in real-world computing resource limited scenarios, such as autonomous cars and drones.
arXiv Detail & Related papers (2022-07-26T03:00:52Z)
Rethinking Cross-modal Interaction from a Top-down Perspective for Referring Video Object Segmentation [140.4291169276062]
Referring video object segmentation (RVOS) aims to segment video objects with the guidance of natural language reference. Previous methods typically tackle RVOS through directly grounding linguistic reference over the image lattice. In this work, we put forward a two-stage, top-down RVOS solution. First, an exhaustive set of object tracklets is constructed by propagating object masks detected from several sampled frames to the entire video. Second, a Transformer-based tracklet-language grounding module is proposed, which models instance-level visual relations and cross-modal interactions simultaneously and efficiently.
arXiv Detail & Related papers (2021-06-02T10:26:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.