Related papers: VMFormer: End-to-End Video Matting with Transformer

VMFormer: End-to-End Video Matting with Transformer

URL: http://arxiv.org/abs/2208.12801v1
Date: Fri, 26 Aug 2022 17:51:02 GMT
Title: VMFormer: End-to-End Video Matting with Transformer
Authors: Jiachen Li, Vidit Goel, Marianna Ohanyan, Shant Navasardyan, Yunchao Wei and Humphrey Shi
Abstract summary: Video matting aims to predict alpha mattes for each frame from a given input video sequence. Recent solutions to video matting have been dominated by deep convolutional neural networks (CNN) We propose VMFormer: a transformer-based end-to-end method for video matting.
Score: 48.97730965527976
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video matting aims to predict the alpha mattes for each frame from a given input video sequence. Recent solutions to video matting have been dominated by deep convolutional neural networks (CNN) for the past few years, which have become the de-facto standard for both academia and industry. However, they have inbuilt inductive bias of locality and do not capture global characteristics of an image due to the CNN-based architectures. They also lack long-range temporal modeling considering computational costs when dealing with feature maps of multiple frames. In this paper, we propose VMFormer: a transformer-based end-to-end method for video matting. It makes predictions on alpha mattes of each frame from learnable queries given a video input sequence. Specifically, it leverages self-attention layers to build global integration of feature sequences with short-range temporal modeling on successive frames. We further apply queries to learn global representations through cross-attention in the transformer decoder with long-range temporal modeling upon all queries. In the prediction stage, both queries and corresponding feature maps are used to make the final prediction of alpha matte. Experiments show that VMFormer outperforms previous CNN-based video matting methods on the composited benchmarks. To our best knowledge, it is the first end-to-end video matting solution built upon a full vision transformer with predictions on the learnable queries. The project is open-sourced at https://chrisjuniorli.github.io/project/VMFormer/

Related papers

FRAME: Pre-Training Video Feature Representations via Anticipation and Memory [55.046881477209695]
FRAME is a self-supervised video frame encoder tailored for dense video understanding.<n>It learns to predict current and future DINO patch features from past and present RGB frames.<n>It consistently outperforms image encoders and existing self-supervised video models.
arXiv Detail & Related papers (2025-06-05T19:44:47Z)
Video Prediction Transformers without Recurrence or Convolution [65.93130697098658]
We propose PredFormer, a framework entirely based on Gated Transformers. We provide a comprehensive analysis of 3D Attention in the context of video prediction. The significant improvements in both accuracy and efficiency highlight the potential of PredFormer.
arXiv Detail & Related papers (2024-10-07T03:52:06Z)
Video Pre-trained Transformer: A Multimodal Mixture of Pre-trained Experts [2.457872341625575]
We present Video Pre-trained Transformer. It uses four SOTA encoder models to convert a video into a sequence of compact embeddings. It learns using an autoregressive causal language modeling loss by predicting the words spoken in YouTube videos.
arXiv Detail & Related papers (2023-03-24T17:18:40Z)
A unified model for continuous conditional video prediction [14.685237010856953]
Conditional video prediction tasks are normally solved by task-related models. Almost all conditional video prediction models can only achieve discrete prediction. In this paper, we propose a unified model that addresses these two issues at the same time.
arXiv Detail & Related papers (2022-10-11T22:26:59Z)
Optimizing Video Prediction via Video Frame Interpolation [53.16726447796844]
We present a new optimization framework for video prediction via video frame, inspired by photo-realistic results of video framescapes. Our framework is based on optimization with a pretrained differentiable video frame module without the need for a training dataset. Our approach outperforms other video prediction methods that require a large amount of training data or extra semantic information.
arXiv Detail & Related papers (2022-06-27T17:03:46Z)
Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation [14.631523634811392]
Masked Conditional Video Diffusion (MCVD) is a general-purpose framework for video prediction. We train the model in a manner where we randomly and independently mask all the past frames or all the future frames. Our approach yields SOTA results across standard video prediction benchmarks, with computation times measured in 1-12 days.
arXiv Detail & Related papers (2022-05-19T20:58:05Z)
Learning Trajectory-Aware Transformer for Video Super-Resolution [50.49396123016185]
Video super-resolution aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts. Existing approaches usually align and aggregate video frames from limited adjacent frames. We propose a novel Transformer for Video Super-Resolution (TTVSR)
arXiv Detail & Related papers (2022-04-08T03:37:39Z)
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling [88.30109041658618]
A great challenge in video-language (VidL) modeling lies in the disconnection between fixed video representations extracted from image/video understanding models and downstream VidL data. We present VIOLET, a fully end-to-end VIdeO-LanguagE Transformer, which adopts a video transformer to explicitly model the temporal dynamics of video inputs.
arXiv Detail & Related papers (2021-11-24T18:31:20Z)
On Pursuit of Designing Multi-modal Transformer for Video Grounding [35.25323276744999]
Video grounding aims to localize the temporal segment corresponding to a sentence query from an untrimmed video. We propose a novel end-to-end multi-modal Transformer model, dubbed as bfGTR. Specifically, GTR has two encoders for video and language encoding, and a cross-modal decoder for grounding prediction. All three typical GTR variants achieve record-breaking performance on all datasets and metrics, with several times faster inference speed.
arXiv Detail & Related papers (2021-09-13T16:01:19Z)
Understanding Road Layout from Videos as a Whole [82.30800791500869]
We formulate it as a top-view road attributes prediction problem and our goal is to predict these attributes for each frame both accurately and consistently. We exploit the following three novel aspects: leveraging camera motions in videos, including context cuesand incorporating long-term video information.
arXiv Detail & Related papers (2020-07-02T00:59:15Z)
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training [75.55823420847759]
We present HERO, a novel framework for large-scale video+language omni-representation learning. HERO encodes multimodal inputs in a hierarchical structure, where local context of a video frame is captured by a Cross-modal Transformer. HERO is jointly trained on HowTo100M and large-scale TV datasets to gain deep understanding of complex social dynamics with multi-character interactions.
arXiv Detail & Related papers (2020-05-01T03:49:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.