VDT: General-purpose Video Diffusion Transformers via Mask Modeling
- URL: http://arxiv.org/abs/2305.13311v2
- Date: Wed, 11 Oct 2023 06:28:41 GMT
- Title: VDT: General-purpose Video Diffusion Transformers via Mask Modeling
- Authors: Haoyu Lu, Guoxing Yang, Nanyi Fei, Yuqi Huo, Zhiwu Lu, Ping Luo,
Mingyu Ding
- Abstract summary: Video Diffusion Transformer (VDT) pioneers the use of transformers in diffusion-based video generation.
We propose a unified spatial-temporal mask modeling mechanism, seamlessly integrated with the model, to cater to diverse video generation scenarios.
- Score: 62.71878864360634
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work introduces Video Diffusion Transformer (VDT), which pioneers the
use of transformers in diffusion-based video generation. It features
transformer blocks with modularized temporal and spatial attention modules to
leverage the rich spatial-temporal representation inherited in transformers. We
also propose a unified spatial-temporal mask modeling mechanism, seamlessly
integrated with the model, to cater to diverse video generation scenarios. VDT
offers several appealing benefits. 1) It excels at capturing temporal
dependencies to produce temporally consistent video frames and even simulate
the physics and dynamics of 3D objects over time. 2) It facilitates flexible
conditioning information, \eg, simple concatenation in the token space,
effectively unifying different token lengths and modalities. 3) Pairing with
our proposed spatial-temporal mask modeling mechanism, it becomes a
general-purpose video diffuser for harnessing a range of tasks, including
unconditional generation, video prediction, interpolation, animation, and
completion, etc. Extensive experiments on these tasks spanning various
scenarios, including autonomous driving, natural weather, human action, and
physics-based simulation, demonstrate the effectiveness of VDT. Additionally,
we present comprehensive studies on how \model handles conditioning information
with the mask modeling mechanism, which we believe will benefit future research
and advance the field. Project page: https:VDT-2023.github.io
Related papers
- MotionAura: Generating High-Quality and Motion Consistent Videos using Discrete Diffusion [3.7270979204213446]
We present four key contributions to address the challenges of video processing.
First, we introduce the 3D Inverted Vector-Quantization Variencoenco Autocoder.
Second, we present MotionAura, a text-to-video generation framework.
Third, we propose a spectral transformer-based denoising network.
Fourth, we introduce a downstream task of Sketch Guided Videopainting.
arXiv Detail & Related papers (2024-10-10T07:07:56Z) - Multi-Scale Temporal Difference Transformer for Video-Text Retrieval [10.509598789325782]
We propose a transformer variant named Multi-Scale Temporal Difference Transformer (MSTDT)
MSTDT mainly addresses the defects of the traditional transformer which has limited ability to capture local temporal information.
In general, our proposed MSTDT consists of a short-term multi-scale temporal difference transformer and a long-term temporal transformer.
arXiv Detail & Related papers (2024-06-23T13:59:31Z) - VMC: Video Motion Customization using Temporal Attention Adaption for
Text-to-Video Diffusion Models [58.93124686141781]
Video Motion Customization (VMC) is a novel one-shot tuning approach crafted to adapt temporal attention layers within video diffusion models.
Our approach introduces a novel motion distillation objective using residual vectors between consecutive frames as a motion reference.
We validate our method against state-of-the-art video generative models across diverse real-world motions and contexts.
arXiv Detail & Related papers (2023-12-01T06:50:11Z) - Time Is MattEr: Temporal Self-supervision for Video Transformers [72.42240984211283]
We design simple yet effective self-supervised tasks for video models to learn temporal dynamics better.
Our method learns the temporal order of video frames as extra self-supervision and enforces the randomly shuffled frames to have low-confidence outputs.
Under various video action recognition tasks, we demonstrate the effectiveness of our method and its compatibility with state-of-the-art Video Transformers.
arXiv Detail & Related papers (2022-07-19T04:44:08Z) - VDTR: Video Deblurring with Transformer [24.20183395758706]
Videoblurring is still an unsolved problem due to the challenging-temporal modeling process.
This paper presents VDTR, an effective Transformer-based model that makes the first attempt to adapt for Transformer video dering.
arXiv Detail & Related papers (2022-04-17T14:22:14Z) - Object-Region Video Transformers [100.23380634952083]
We present Object-Region Transformers Video (ORViT), an emphobject-centric approach that extends transformer video layers with object representations.
Our ORViT block consists of two object-level streams: appearance and dynamics.
We show strong improvement in performance across all tasks and considered, demonstrating the value of a model that incorporates object representations into a transformer architecture.
arXiv Detail & Related papers (2021-10-13T17:51:46Z) - VideoLightFormer: Lightweight Action Recognition using Transformers [8.871042314510788]
We propose a novel, lightweight action recognition architecture, VideoLightFormer.
In a factorized fashion, we carefully extend the 2D convolutional Temporal Network with transformers.
We evaluate VideoLightFormer in a high-efficiency setting on the temporally-demanding EPIC-KITCHENS-100 and Something-SV-V-Something2 datasets.
arXiv Detail & Related papers (2021-07-01T13:55:52Z) - Space-time Mixing Attention for Video Transformer [55.50839896863275]
We propose a Video Transformer model the complexity of which scales linearly with the number of frames in the video sequence.
We demonstrate that our model produces very high recognition accuracy on the most popular video recognition datasets.
arXiv Detail & Related papers (2021-06-10T17:59:14Z) - A Video Is Worth Three Views: Trigeminal Transformers for Video-based
Person Re-identification [77.08204941207985]
Video-based person re-identification (Re-ID) aims to retrieve video sequences of the same person under non-overlapping cameras.
We propose a novel framework named Trigeminal Transformers (TMT) for video-based person Re-ID.
arXiv Detail & Related papers (2021-04-05T02:50:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.