VideoMaMa: Mask-Guided Video Matting via Generative Prior
- URL: http://arxiv.org/abs/2601.14255v1
- Date: Tue, 20 Jan 2026 18:59:56 GMT
- Title: VideoMaMa: Mask-Guided Video Matting via Generative Prior
- Authors: Sangbeom Lim, Seoung Wug Oh, Jiahui Huang, Heeji Yoon, Seungryong Kim, Joon-Young Lee,
- Abstract summary: Generalizing video matting models to real-world videos remains a significant challenge due to the scarcity of labeled data.<n>We present Video Mask-to-Matte Model (VideoMaMa) that converts coarse segmentation masks into pixel accurate alpha mattes.<n>We build a scalable pseudo-labeling pipeline for large-scale video matting and construct the Matting Anything in Video dataset.
- Score: 73.03369602195563
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generalizing video matting models to real-world videos remains a significant challenge due to the scarcity of labeled data. To address this, we present Video Mask-to-Matte Model (VideoMaMa) that converts coarse segmentation masks into pixel accurate alpha mattes, by leveraging pretrained video diffusion models. VideoMaMa demonstrates strong zero-shot generalization to real-world footage, even though it is trained solely on synthetic data. Building on this capability, we develop a scalable pseudo-labeling pipeline for large-scale video matting and construct the Matting Anything in Video (MA-V) dataset, which offers high-quality matting annotations for more than 50K real-world videos spanning diverse scenes and motions. To validate the effectiveness of this dataset, we fine-tune the SAM2 model on MA-V to obtain SAM2-Matte, which outperforms the same model trained on existing matting datasets in terms of robustness on in-the-wild videos. These findings emphasize the importance of large-scale pseudo-labeled video matting and showcase how generative priors and accessible segmentation cues can drive scalable progress in video matting research.
Related papers
- IMAGEdit: Let Any Subject Transform [61.666509860041124]
IMAGEdit is a training-free framework for any number of video subject editing.<n>It manipulates the appearances of multiple designated subjects while preserving non-target regions.<n>It is compatible with any mask-driven video generation model.
arXiv Detail & Related papers (2025-10-01T17:59:56Z) - Generative Video Matting [57.186684844156595]
Video matting has traditionally been limited by the lack of high-quality ground-truth data.<n>Most existing video matting datasets provide only human-annotated imperfect alpha and foreground annotations.<n>We introduce a novel video matting approach that can effectively leverage the rich priors from pre-trained video diffusion models.
arXiv Detail & Related papers (2025-08-11T12:18:55Z) - VRMDiff: Text-Guided Video Referring Matting Generation of Diffusion [9.465414294387507]
We propose a new task, video referring matting, which obtains the alpha matte of a specified instance by inputting a referring caption.<n>We treat the dense prediction task of matting as video generation, leveraging the text-to-video alignment prior to video diffusion models.<n>We introduce a large-scale video referring matting dataset with 10,000 videos.
arXiv Detail & Related papers (2025-03-11T06:12:35Z) - Data Collection-free Masked Video Modeling [6.641717260925999]
We introduce an effective self-supervised learning framework for videos that leverages and less costly static images.
These pseudo-motion videos are then leveraged in masked video modeling.
Our approach is applicable to synthetic images as well, thus entirely freeing video-training from data collection costs other concerns in real data.
arXiv Detail & Related papers (2024-09-10T17:34:07Z) - VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking [57.552798046137646]
Video masked autoencoder (VideoMAE) is a scalable and general self-supervised pre-trainer for building video foundation models.
We successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance.
arXiv Detail & Related papers (2023-03-29T14:28:41Z) - VIOLET : End-to-End Video-Language Transformers with Masked Visual-token
Modeling [88.30109041658618]
A great challenge in video-language (VidL) modeling lies in the disconnection between fixed video representations extracted from image/video understanding models and downstream VidL data.
We present VIOLET, a fully end-to-end VIdeO-LanguagE Transformer, which adopts a video transformer to explicitly model the temporal dynamics of video inputs.
arXiv Detail & Related papers (2021-11-24T18:31:20Z) - Attention-guided Temporal Coherent Video Object Matting [78.82835351423383]
We propose a novel deep learning-based object matting method that can achieve temporally coherent matting results.
Its key component is an attention-based temporal aggregation module that maximizes image matting networks' strength.
We show how to effectively solve the trimap generation problem by fine-tuning a state-of-the-art video object segmentation network.
arXiv Detail & Related papers (2021-05-24T17:34:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.