MatAnyone: Stable Video Matting with Consistent Memory Propagation
- URL: http://arxiv.org/abs/2501.14677v1
- Date: Fri, 24 Jan 2025 17:56:24 GMT
- Title: MatAnyone: Stable Video Matting with Consistent Memory Propagation
- Authors: Peiqing Yang, Shangchen Zhou, Jixin Zhao, Qingyi Tao, Chen Change Loy,
- Abstract summary: MatAnyone is a robust framework tailored for target-assigned video matting.
We introduce a consistent memory propagation module via region-adaptive memory fusion.
For robust training, we present a larger, high-quality, and diverse dataset for video matting.
- Score: 55.93983057352684
- License:
- Abstract: Auxiliary-free human video matting methods, which rely solely on input frames, often struggle with complex or ambiguous backgrounds. To address this, we propose MatAnyone, a robust framework tailored for target-assigned video matting. Specifically, building on a memory-based paradigm, we introduce a consistent memory propagation module via region-adaptive memory fusion, which adaptively integrates memory from the previous frame. This ensures semantic stability in core regions while preserving fine-grained details along object boundaries. For robust training, we present a larger, high-quality, and diverse dataset for video matting. Additionally, we incorporate a novel training strategy that efficiently leverages large-scale segmentation data, boosting matting stability. With this new network design, dataset, and training strategy, MatAnyone delivers robust and accurate video matting results in diverse real-world scenarios, outperforming existing methods.
Related papers
- Rethinking Video Segmentation with Masked Video Consistency: Did the Model Learn as Intended? [22.191260650245443]
Video segmentation aims at partitioning video sequences into meaningful segments based on objects or regions of interest within frames.
Current video segmentation models are often derived from image segmentation techniques, which struggle to cope with small-scale or class-imbalanced video datasets.
We propose a training strategy Masked Video Consistency, which enhances spatial and temporal feature aggregation.
arXiv Detail & Related papers (2024-08-20T08:08:32Z) - A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames [57.758863967770594]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.
We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z) - Adaptive Human Matting for Dynamic Videos [62.026375402656754]
Adaptive Matting for Dynamic Videos, termed AdaM, is a framework for simultaneously differentiating foregrounds from backgrounds.
Two interconnected network designs are employed to achieve this goal.
We benchmark and study our methods recently introduced datasets, showing that our matting achieves new best-in-class generalizability.
arXiv Detail & Related papers (2023-04-12T17:55:59Z) - Multi-Scale Memory-Based Video Deblurring [34.488707652997704]
We design a memory branch to memorize the blurry-sharp feature pairs in the memory bank.
To enrich the memory of our memory bank, we also designed a bidirectional recurrency and multi-scale strategy.
Experimental results demonstrate that our model outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2022-04-06T08:48:56Z) - Attention-guided Temporal Coherent Video Object Matting [78.82835351423383]
We propose a novel deep learning-based object matting method that can achieve temporally coherent matting results.
Its key component is an attention-based temporal aggregation module that maximizes image matting networks' strength.
We show how to effectively solve the trimap generation problem by fine-tuning a state-of-the-art video object segmentation network.
arXiv Detail & Related papers (2021-05-24T17:34:57Z) - Memory Enhanced Embedding Learning for Cross-Modal Video-Text Retrieval [155.32369959647437]
Cross-modal video-text retrieval is a challenging task in the field of vision and language.
Existing approaches for this task all focus on how to design encoding model through a hard negative ranking loss.
We propose a novel memory enhanced embedding learning (MEEL) method for videotext retrieval.
arXiv Detail & Related papers (2021-03-29T15:15:09Z) - Understanding Road Layout from Videos as a Whole [82.30800791500869]
We formulate it as a top-view road attributes prediction problem and our goal is to predict these attributes for each frame both accurately and consistently.
We exploit the following three novel aspects: leveraging camera motions in videos, including context cuesand incorporating long-term video information.
arXiv Detail & Related papers (2020-07-02T00:59:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.