VMFormer: End-to-End Video Matting with Transformer
- URL: http://arxiv.org/abs/2208.12801v1
- Date: Fri, 26 Aug 2022 17:51:02 GMT
- Title: VMFormer: End-to-End Video Matting with Transformer
- Authors: Jiachen Li, Vidit Goel, Marianna Ohanyan, Shant Navasardyan, Yunchao
Wei and Humphrey Shi
- Abstract summary: Video matting aims to predict alpha mattes for each frame from a given input video sequence.
Recent solutions to video matting have been dominated by deep convolutional neural networks (CNN)
We propose VMFormer: a transformer-based end-to-end method for video matting.
- Score: 48.97730965527976
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video matting aims to predict the alpha mattes for each frame from a given
input video sequence. Recent solutions to video matting have been dominated by
deep convolutional neural networks (CNN) for the past few years, which have
become the de-facto standard for both academia and industry. However, they have
inbuilt inductive bias of locality and do not capture global characteristics of
an image due to the CNN-based architectures. They also lack long-range temporal
modeling considering computational costs when dealing with feature maps of
multiple frames. In this paper, we propose VMFormer: a transformer-based
end-to-end method for video matting. It makes predictions on alpha mattes of
each frame from learnable queries given a video input sequence. Specifically,
it leverages self-attention layers to build global integration of feature
sequences with short-range temporal modeling on successive frames. We further
apply queries to learn global representations through cross-attention in the
transformer decoder with long-range temporal modeling upon all queries. In the
prediction stage, both queries and corresponding feature maps are used to make
the final prediction of alpha matte. Experiments show that VMFormer outperforms
previous CNN-based video matting methods on the composited benchmarks. To our
best knowledge, it is the first end-to-end video matting solution built upon a
full vision transformer with predictions on the learnable queries. The project
is open-sourced at https://chrisjuniorli.github.io/project/VMFormer/
Related papers
- Video Pre-trained Transformer: A Multimodal Mixture of Pre-trained
Experts [2.457872341625575]
We present Video Pre-trained Transformer.
It uses four SOTA encoder models to convert a video into a sequence of compact embeddings.
It learns using an autoregressive causal language modeling loss by predicting the words spoken in YouTube videos.
arXiv Detail & Related papers (2023-03-24T17:18:40Z) - A unified model for continuous conditional video prediction [14.685237010856953]
Conditional video prediction tasks are normally solved by task-related models.
Almost all conditional video prediction models can only achieve discrete prediction.
In this paper, we propose a unified model that addresses these two issues at the same time.
arXiv Detail & Related papers (2022-10-11T22:26:59Z) - Optimizing Video Prediction via Video Frame Interpolation [53.16726447796844]
We present a new optimization framework for video prediction via video frame, inspired by photo-realistic results of video framescapes.
Our framework is based on optimization with a pretrained differentiable video frame module without the need for a training dataset.
Our approach outperforms other video prediction methods that require a large amount of training data or extra semantic information.
arXiv Detail & Related papers (2022-06-27T17:03:46Z) - Masked Conditional Video Diffusion for Prediction, Generation, and
Interpolation [14.631523634811392]
Masked Conditional Video Diffusion (MCVD) is a general-purpose framework for video prediction.
We train the model in a manner where we randomly and independently mask all the past frames or all the future frames.
Our approach yields SOTA results across standard video prediction benchmarks, with computation times measured in 1-12 days.
arXiv Detail & Related papers (2022-05-19T20:58:05Z) - Learning Trajectory-Aware Transformer for Video Super-Resolution [50.49396123016185]
Video super-resolution aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts.
Existing approaches usually align and aggregate video frames from limited adjacent frames.
We propose a novel Transformer for Video Super-Resolution (TTVSR)
arXiv Detail & Related papers (2022-04-08T03:37:39Z) - VIOLET : End-to-End Video-Language Transformers with Masked Visual-token
Modeling [88.30109041658618]
A great challenge in video-language (VidL) modeling lies in the disconnection between fixed video representations extracted from image/video understanding models and downstream VidL data.
We present VIOLET, a fully end-to-end VIdeO-LanguagE Transformer, which adopts a video transformer to explicitly model the temporal dynamics of video inputs.
arXiv Detail & Related papers (2021-11-24T18:31:20Z) - On Pursuit of Designing Multi-modal Transformer for Video Grounding [35.25323276744999]
Video grounding aims to localize the temporal segment corresponding to a sentence query from an untrimmed video.
We propose a novel end-to-end multi-modal Transformer model, dubbed as bfGTR. Specifically, GTR has two encoders for video and language encoding, and a cross-modal decoder for grounding prediction.
All three typical GTR variants achieve record-breaking performance on all datasets and metrics, with several times faster inference speed.
arXiv Detail & Related papers (2021-09-13T16:01:19Z) - Understanding Road Layout from Videos as a Whole [82.30800791500869]
We formulate it as a top-view road attributes prediction problem and our goal is to predict these attributes for each frame both accurately and consistently.
We exploit the following three novel aspects: leveraging camera motions in videos, including context cuesand incorporating long-term video information.
arXiv Detail & Related papers (2020-07-02T00:59:15Z) - HERO: Hierarchical Encoder for Video+Language Omni-representation
Pre-training [75.55823420847759]
We present HERO, a novel framework for large-scale video+language omni-representation learning.
HERO encodes multimodal inputs in a hierarchical structure, where local context of a video frame is captured by a Cross-modal Transformer.
HERO is jointly trained on HowTo100M and large-scale TV datasets to gain deep understanding of complex social dynamics with multi-character interactions.
arXiv Detail & Related papers (2020-05-01T03:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.