RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane
Networks
- URL: http://arxiv.org/abs/2401.06035v1
- Date: Thu, 11 Jan 2024 16:48:44 GMT
- Title: RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane
Networks
- Authors: Partha Ghosh, Soubhik Sanyal, Cordelia Schmid, Bernhard Sch\"olkopf
- Abstract summary: We present a novel unconditional video generative model designed to address long-term spatial and temporal dependencies.
Our approach reduces computational complexity by a factor of $2$ as measured in FLOPs.
Our model is capable of synthesizing high-fidelity video clips at a resolution of $256times256$ pixels, with durations extending to more than $5$ seconds at a frame rate of 30 fps.
- Score: 63.84589410872608
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a novel unconditional video generative model designed to address
long-term spatial and temporal dependencies. To capture these dependencies, our
approach incorporates a hybrid explicit-implicit tri-plane representation
inspired by 3D-aware generative frameworks developed for three-dimensional
object representation and employs a singular latent code to model an entire
video sequence. Individual video frames are then synthesized from an
intermediate tri-plane representation, which itself is derived from the primary
latent code. This novel strategy reduces computational complexity by a factor
of $2$ as measured in FLOPs. Consequently, our approach facilitates the
efficient and temporally coherent generation of videos. Moreover, our joint
frame modeling approach, in contrast to autoregressive methods, mitigates the
generation of visual artifacts. We further enhance the model's capabilities by
integrating an optical flow-based module within our Generative Adversarial
Network (GAN) based generator architecture, thereby compensating for the
constraints imposed by a smaller generator size. As a result, our model is
capable of synthesizing high-fidelity video clips at a resolution of
$256\times256$ pixels, with durations extending to more than $5$ seconds at a
frame rate of 30 fps. The efficacy and versatility of our approach are
empirically validated through qualitative and quantitative assessments across
three different datasets comprising both synthetic and real video clips.
Related papers
- ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation [83.62931466231898]
This paper presents ARLON, a framework that boosts diffusion Transformers with autoregressive models for long video generation.
A latent Vector Quantized Variational Autoencoder (VQ-VAE) compresses the input latent space of the DiT model into compact visual tokens.
An adaptive norm-based semantic injection module integrates the coarse discrete visual units from the AR model into the DiT model.
arXiv Detail & Related papers (2024-10-27T16:28:28Z) - MotionAura: Generating High-Quality and Motion Consistent Videos using Discrete Diffusion [3.7270979204213446]
We present four key contributions to address the challenges of video processing.
First, we introduce the 3D Inverted Vector-Quantization Variencoenco Autocoder.
Second, we present MotionAura, a text-to-video generation framework.
Third, we propose a spectral transformer-based denoising network.
Fourth, we introduce a downstream task of Sketch Guided Videopainting.
arXiv Detail & Related papers (2024-10-10T07:07:56Z) - Flex3D: Feed-Forward 3D Generation With Flexible Reconstruction Model And Input View Curation [61.040832373015014]
We propose Flex3D, a novel framework for generating high-quality 3D content from text, single images, or sparse view images.
We employ a fine-tuned multi-view image diffusion model and a video diffusion model to generate a pool of candidate views, enabling a rich representation of the target 3D object.
In the second stage, the curated views are fed into a Flexible Reconstruction Model (FlexRM), built upon a transformer architecture that can effectively process an arbitrary number of inputs.
arXiv Detail & Related papers (2024-10-01T17:29:43Z) - VideoMV: Consistent Multi-View Generation Based on Large Video Generative Model [34.35449902855767]
Two fundamental questions are what data we use for training and how to ensure multi-view consistency.
We propose a dense consistent multi-view generation model that is fine-tuned from off-the-shelf video generative models.
Our approach can generate 24 dense views and converges much faster in training than state-of-the-art approaches.
arXiv Detail & Related papers (2024-03-18T17:48:15Z) - Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action
Recognition [112.66832145320434]
Video-FocalNet is an effective and efficient architecture for video recognition that models both local global contexts.
Video-FocalNet is based on a-temporal focal modulation architecture that reverses the interaction and aggregation steps of self-attention.
We show that Video-FocalNets perform favorably well against state-of-the-art transformer-based models for video recognition on five large-scale datasets.
arXiv Detail & Related papers (2023-07-13T17:59:33Z) - Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation [55.36617538438858]
We propose a novel approach that strengthens the interaction between spatial and temporal perceptions.
We curate a large-scale and open-source video dataset called HD-VG-130M.
arXiv Detail & Related papers (2023-05-18T11:06:15Z) - Video Probabilistic Diffusion Models in Projected Latent Space [75.4253202574722]
We propose a novel generative model for videos, coined projected latent video diffusion models (PVDM)
PVDM learns a video distribution in a low-dimensional latent space and thus can be efficiently trained with high-resolution videos under limited resources.
arXiv Detail & Related papers (2023-02-15T14:22:34Z) - Streaming Radiance Fields for 3D Video Synthesis [32.856346090347174]
We present an explicit-grid based method for reconstructing streaming radiance fields for novel view synthesis of real world dynamic scenes.
Experiments on challenging video sequences demonstrate that our approach is capable of achieving a training speed of 15 seconds per-frame with competitive rendering quality.
arXiv Detail & Related papers (2022-10-26T16:23:02Z) - Spatio-Temporal Self-Attention Network for Video Saliency Prediction [13.873682190242365]
3D convolutional neural networks have achieved promising results for video tasks in computer vision.
We propose a novel Spatio-Temporal Self-Temporal Self-Attention 3 Network (STSANet) for video saliency prediction.
arXiv Detail & Related papers (2021-08-24T12:52:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.