MatAnyone 2: Scaling Video Matting via a Learned Quality Evaluator
- URL: http://arxiv.org/abs/2512.11782v1
- Date: Fri, 12 Dec 2025 18:51:49 GMT
- Title: MatAnyone 2: Scaling Video Matting via a Learned Quality Evaluator
- Authors: Peiqing Yang, Shangchen Zhou, Kai Hao, Qingyi Tao,
- Abstract summary: We introduce a learned Matting Quality Evaluator (MQE) that assesses semantic and boundary quality of alpha mattes without ground truth.<n>MQE scales up video matting in two ways: as an online matting-quality feedback during training to suppress erroneous regions, and as an offline selection module for data curation.<n>Our MatAnyone 2 achieves state-of-the-art performance on both synthetic and real-world benchmarks, surpassing prior methods across all metrics.
- Score: 20.570147774716343
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video matting remains limited by the scale and realism of existing datasets. While leveraging segmentation data can enhance semantic stability, the lack of effective boundary supervision often leads to segmentation-like mattes lacking fine details. To this end, we introduce a learned Matting Quality Evaluator (MQE) that assesses semantic and boundary quality of alpha mattes without ground truth. It produces a pixel-wise evaluation map that identifies reliable and erroneous regions, enabling fine-grained quality assessment. The MQE scales up video matting in two ways: (1) as an online matting-quality feedback during training to suppress erroneous regions, providing comprehensive supervision, and (2) as an offline selection module for data curation, improving annotation quality by combining the strengths of leading video and image matting models. This process allows us to build a large-scale real-world video matting dataset, VMReal, containing 28K clips and 2.4M frames. To handle large appearance variations in long videos, we introduce a reference-frame training strategy that incorporates long-range frames beyond the local window for effective training. Our MatAnyone 2 achieves state-of-the-art performance on both synthetic and real-world benchmarks, surpassing prior methods across all metrics.
Related papers
- VideoMaMa: Mask-Guided Video Matting via Generative Prior [73.03369602195563]
Generalizing video matting models to real-world videos remains a significant challenge due to the scarcity of labeled data.<n>We present Video Mask-to-Matte Model (VideoMaMa) that converts coarse segmentation masks into pixel accurate alpha mattes.<n>We build a scalable pseudo-labeling pipeline for large-scale video matting and construct the Matting Anything in Video dataset.
arXiv Detail & Related papers (2026-01-20T18:59:56Z) - CAMP-VQA: Caption-Embedded Multimodal Perception for No-Reference Quality Assessment of Compressed Video [9.172799792564009]
We propose CAMP-VQA, a novel NR-VQA framework that exploits the semantic understanding capabilities of large models.<n>Our approach introduces a quality-aware video metadata mechanism that integrates key fragments extracted from inter-frame variations.<n>Our model consistently outperforms existing NR-VQA methods, achieving improved accuracy without the need for costly manual fine-grained annotations.
arXiv Detail & Related papers (2025-11-10T16:37:47Z) - Generative Video Matting [57.186684844156595]
Video matting has traditionally been limited by the lack of high-quality ground-truth data.<n>Most existing video matting datasets provide only human-annotated imperfect alpha and foreground annotations.<n>We introduce a novel video matting approach that can effectively leverage the rich priors from pre-trained video diffusion models.
arXiv Detail & Related papers (2025-08-11T12:18:55Z) - Bridging Video Quality Scoring and Justification via Large Multimodal Models [14.166920184033463]
Classical video quality assessment (VQA) methods generate a numerical score to judge a video's perceived visual fidelity and clarity.<n>But a score fails to describe the video's complex quality dimensions, restricting its applicability.<n>Benefiting from the linguistic output, adapting video large multimodal models (LMMs) to VQA via instruction tuning has the potential to address this issue.
arXiv Detail & Related papers (2025-06-26T05:02:25Z) - Towards Generalized Video Quality Assessment: A Weak-to-Strong Learning Paradigm [76.63001244080313]
Video quality assessment (VQA) seeks to predict the perceptual quality of a video in alignment with human visual perception.<n>The dominant VQA paradigm relies on supervised training with human-labeled datasets.<n>We explore weak-to-strong (W2S) learning as a new paradigm for advancing VQA without reliance on large-scale human-labeled datasets.
arXiv Detail & Related papers (2025-05-06T15:29:32Z) - MatAnyone: Stable Video Matting with Consistent Memory Propagation [55.93983057352684]
MatAnyone is a robust framework tailored for target-assigned video matting.<n>We introduce a consistent memory propagation module via region-adaptive memory fusion.<n>For robust training, we present a larger, high-quality, and diverse dataset for video matting.
arXiv Detail & Related papers (2025-01-24T17:56:24Z) - Adaptive Human Matting for Dynamic Videos [62.026375402656754]
Adaptive Matting for Dynamic Videos, termed AdaM, is a framework for simultaneously differentiating foregrounds from backgrounds.
Two interconnected network designs are employed to achieve this goal.
We benchmark and study our methods recently introduced datasets, showing that our matting achieves new best-in-class generalizability.
arXiv Detail & Related papers (2023-04-12T17:55:59Z) - Attention-guided Temporal Coherent Video Object Matting [78.82835351423383]
We propose a novel deep learning-based object matting method that can achieve temporally coherent matting results.
Its key component is an attention-based temporal aggregation module that maximizes image matting networks' strength.
We show how to effectively solve the trimap generation problem by fine-tuning a state-of-the-art video object segmentation network.
arXiv Detail & Related papers (2021-05-24T17:34:57Z) - The DEVIL is in the Details: A Diagnostic Evaluation Benchmark for Video
Inpainting [43.90848669491335]
We propose the Diagnostic Evaluation of Video Inpainting on Landscapes (DEVIL) benchmark, which consists of two contributions.
Our challenging benchmark enables more insightful analysis into video inpainting methods and serves as an invaluable diagnostic tool for the field.
arXiv Detail & Related papers (2021-05-11T20:13:53Z) - Deep Video Matting via Spatio-Temporal Alignment and Aggregation [63.6870051909004]
We propose a deep learning-based video matting framework which employs a novel aggregation feature module (STFAM)
To eliminate frame-by-frame trimap annotations, a lightweight interactive trimap propagation network is also introduced.
Our framework significantly outperforms conventional video matting and deep image matting methods.
arXiv Detail & Related papers (2021-04-22T17:42:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.