Adaptive Human Matting for Dynamic Videos
- URL: http://arxiv.org/abs/2304.06018v1
- Date: Wed, 12 Apr 2023 17:55:59 GMT
- Title: Adaptive Human Matting for Dynamic Videos
- Authors: Chung-Ching Lin, Jiang Wang, Kun Luo, Kevin Lin, Linjie Li, Lijuan
Wang, Zicheng Liu
- Abstract summary: Adaptive Matting for Dynamic Videos, termed AdaM, is a framework for simultaneously differentiating foregrounds from backgrounds.
Two interconnected network designs are employed to achieve this goal.
We benchmark and study our methods recently introduced datasets, showing that our matting achieves new best-in-class generalizability.
- Score: 62.026375402656754
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The most recent efforts in video matting have focused on eliminating trimap
dependency since trimap annotations are expensive and trimap-based methods are
less adaptable for real-time applications. Despite the latest tripmap-free
methods showing promising results, their performance often degrades when
dealing with highly diverse and unstructured videos. We address this limitation
by introducing Adaptive Matting for Dynamic Videos, termed AdaM, which is a
framework designed for simultaneously differentiating foregrounds from
backgrounds and capturing alpha matte details of human subjects in the
foreground. Two interconnected network designs are employed to achieve this
goal: (1) an encoder-decoder network that produces alpha mattes and
intermediate masks which are used to guide the transformer in adaptively
decoding foregrounds and backgrounds, and (2) a transformer network in which
long- and short-term attention combine to retain spatial and temporal contexts,
facilitating the decoding of foreground details. We benchmark and study our
methods on recently introduced datasets, showing that our model notably
improves matting realism and temporal coherence in complex real-world videos
and achieves new best-in-class generalizability. Further details and examples
are available at https://github.com/microsoft/AdaM.
Related papers
- SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - VidLA: Video-Language Alignment at Scale [48.665918882615195]
We propose VidLA, an approach for video-language alignment at scale.
Our proposed approach surpasses state-of-the-art methods on multiple retrieval benchmarks.
arXiv Detail & Related papers (2024-03-21T22:36:24Z) - UMMAFormer: A Universal Multimodal-adaptive Transformer Framework for
Temporal Forgery Localization [16.963092523737593]
We propose a novel framework for temporal forgery localization (TFL) that predicts forgery segments with multimodal adaptation.
Our approach achieves state-of-the-art performance on benchmark datasets, including Lav-DF, TVIL, and Psynd.
arXiv Detail & Related papers (2023-08-28T08:20:30Z) - Attention-guided Temporal Coherent Video Object Matting [78.82835351423383]
We propose a novel deep learning-based object matting method that can achieve temporally coherent matting results.
Its key component is an attention-based temporal aggregation module that maximizes image matting networks' strength.
We show how to effectively solve the trimap generation problem by fine-tuning a state-of-the-art video object segmentation network.
arXiv Detail & Related papers (2021-05-24T17:34:57Z) - Deep Video Matting via Spatio-Temporal Alignment and Aggregation [63.6870051909004]
We propose a deep learning-based video matting framework which employs a novel aggregation feature module (STFAM)
To eliminate frame-by-frame trimap annotations, a lightweight interactive trimap propagation network is also introduced.
Our framework significantly outperforms conventional video matting and deep image matting methods.
arXiv Detail & Related papers (2021-04-22T17:42:08Z) - Coarse-Fine Networks for Temporal Activity Detection in Videos [45.03545172714305]
We introduce 'Co-Fine Networks', a two-stream architecture which benefits from different abstractions of temporal resolution to learn better video representations for long-term motion.
We show that our method can outperform the state-of-the-arts for action detection in public datasets with a significantly reduced compute and memory footprint.
arXiv Detail & Related papers (2021-03-01T20:48:01Z) - TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization
Tasks [79.01176229586855]
We propose a novel supervised pretraining paradigm for clip features that considers background clips and global video information to improve temporal sensitivity.
Extensive experiments show that using features trained with our novel pretraining strategy significantly improves the performance of recent state-of-the-art methods on three tasks.
arXiv Detail & Related papers (2020-11-23T15:40:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.