S2D: Sparse-To-Dense Keymask Distillation for Unsupervised Video Instance Segmentation
- URL: http://arxiv.org/abs/2512.14440v1
- Date: Tue, 16 Dec 2025 14:26:30 GMT
- Title: S2D: Sparse-To-Dense Keymask Distillation for Unsupervised Video Instance Segmentation
- Authors: Leon Sick, Lukas Hoyer, Dominik Engel, Pedro Hermosilla, Timo Ropinski,
- Abstract summary: We propose an unsupervised video instance segmentation model trained exclusively on real video data.<n>We establish temporal coherence by identifying high-quality keymasks in the video by leveraging deep motion priors.<n>Our approach outperforms the current state-of-the-art across various benchmarks.
- Score: 27.42479195861311
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, the state-of-the-art in unsupervised video instance segmentation has heavily relied on synthetic video data, generated from object-centric image datasets such as ImageNet. However, video synthesis by artificially shifting and scaling image instance masks fails to accurately model realistic motion in videos, such as perspective changes, movement by parts of one or multiple instances, or camera motion. To tackle this issue, we propose an unsupervised video instance segmentation model trained exclusively on real video data. We start from unsupervised instance segmentation masks on individual video frames. However, these single-frame segmentations exhibit temporal noise and their quality varies through the video. Therefore, we establish temporal coherence by identifying high-quality keymasks in the video by leveraging deep motion priors. The sparse keymask pseudo-annotations are then used to train a segmentation model for implicit mask propagation, for which we propose a Sparse-To-Dense Distillation approach aided by a Temporal DropLoss. After training the final model on the resulting dense labelset, our approach outperforms the current state-of-the-art across various benchmarks.
Related papers
- AlcheMinT: Fine-grained Temporal Control for Multi-Reference Consistent Video Generation [58.844504598618094]
We propose AlcheMinT, a unified framework that introduces explicit timestamps conditioning for subject-driven video generation.<n>Our approach introduces a novel positional encoding mechanism that unlocks the encoding of temporal intervals, associated in our case with subject identities.<n>We incorporate subject-descriptive text tokens to strengthen binding between visual identity and video captions, mitigating ambiguity during generation.
arXiv Detail & Related papers (2025-12-11T18:59:34Z) - DirectSwap: Mask-Free Cross-Identity Training and Benchmarking for Expression-Consistent Video Head Swapping [58.2549561389375]
Video head swapping aims to replace the entire head of a video subject, including facial identity, head shape, and hairstyle, with that of a reference image.<n>Due to the lack of ground-truth paired swapping data, prior methods typically train on cross-frame pairs of the same person within a video.<n>We propose DirectSwap, a mask-free, direct video head-swapping framework that extends an image U-Net into a video diffusion model.
arXiv Detail & Related papers (2025-12-10T08:31:28Z) - FlowCut: Unsupervised Video Instance Segmentation via Temporal Mask Matching [19.401125268811015]
FlowCut is a method for unsupervised video instance segmentation consisting of a three-stage framework.<n>In the first stage, we generate pseudo-instance masks by exploiting the affinities of features from both images and optical flows.<n>In the second stage, we construct short video segments containing high-quality, consistent pseudo-instance masks by temporally matching them across the frames.<n>In the third stage, we use the YouTubeVIS-2021 video dataset to extract our training instance segmentation set, and then train a video segmentation model.
arXiv Detail & Related papers (2025-05-19T14:30:33Z) - Lester: rotoscope animation through video object segmentation and
tracking [0.0]
Lester is a novel method to automatically synthetise retro-style 2D animations from videos.
Video frames are processed with the Segment Anything Model (SAM) and the resulting masks are tracked through subsequent frames with DeAOT.
Results show that the method exhibits an excellent temporal consistency and can correctly process videos with different poses and appearances.
arXiv Detail & Related papers (2024-02-15T11:15:54Z) - Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - TubeFormer-DeepLab: Video Mask Transformer [98.47947102154217]
We present TubeFormer-DeepLab, the first attempt to tackle multiple core video segmentation tasks in a unified manner.
TubeFormer-DeepLab directly predicts video tubes with task-specific labels.
arXiv Detail & Related papers (2022-05-30T18:10:33Z) - Guess What Moves: Unsupervised Video and Image Segmentation by
Anticipating Motion [92.80981308407098]
We propose an approach that combines the strengths of motion-based and appearance-based segmentation.
We propose to supervise an image segmentation network, tasking it with predicting regions that are likely to contain simple motion patterns.
In the unsupervised video segmentation mode, the network is trained on a collection of unlabelled videos, using the learning process itself as an algorithm to segment these videos.
arXiv Detail & Related papers (2022-05-16T17:55:34Z) - Tag-Based Attention Guided Bottom-Up Approach for Video Instance
Segmentation [83.13610762450703]
Video instance is a fundamental computer vision task that deals with segmenting and tracking object instances across a video sequence.
We introduce a simple end-to-end train bottomable-up approach to achieve instance mask predictions at the pixel-level granularity, instead of the typical region-proposals-based approach.
Our method provides competitive results on YouTube-VIS and DAVIS-19 datasets, and has minimum run-time compared to other contemporary state-of-the-art performance methods.
arXiv Detail & Related papers (2022-04-22T15:32:46Z) - Spatial Feature Calibration and Temporal Fusion for Effective One-stage
Video Instance Segmentation [16.692219644392253]
We propose a one-stage video instance segmentation framework by spatial calibration and temporal fusion, namely STMask.
Experiments on the YouTube-VIS valid set show that the proposed STMask with ResNet-50/-101 backbone obtains 33.5 % / 36.8 % mask AP, while achieving 28.6 / 23.4 FPS on video instance segmentation.
arXiv Detail & Related papers (2021-04-06T09:26:58Z) - Weakly Supervised Instance Segmentation for Videos with Temporal Mask
Consistency [28.352140544936198]
Weakly supervised instance segmentation reduces the cost of annotations required to train models.
We show that these issues can be better addressed by training with weakly labeled videos instead of images.
We are the first to explore the use of these video signals to tackle weakly supervised instance segmentation.
arXiv Detail & Related papers (2021-03-23T23:20:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.