StyleGAN-V: A Continuous Video Generator with the Price, Image Quality
and Perks of StyleGAN2
- URL: http://arxiv.org/abs/2112.14683v1
- Date: Wed, 29 Dec 2021 17:58:29 GMT
- Title: StyleGAN-V: A Continuous Video Generator with the Price, Image Quality
and Perks of StyleGAN2
- Authors: Ivan Skorokhodov, Sergey Tulyakov, Mohamed Elhoseiny
- Abstract summary: We think of videos of what they should be - time-continuous signals, and extend the paradigm of neural representations to build a continuous-time video generator.
We build our model on top of StyleGAN2 and it is just 5% more expensive to train at the same resolution while achieving almost the same image quality.
Our model achieves state-of-the-art results on four modern 256$2$ video synthesis benchmarks and one 1024$2$ resolution one.
- Score: 39.835681276854025
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Videos show continuous events, yet most - if not all - video synthesis
frameworks treat them discretely in time. In this work, we think of videos of
what they should be - time-continuous signals, and extend the paradigm of
neural representations to build a continuous-time video generator. For this, we
first design continuous motion representations through the lens of positional
embeddings. Then, we explore the question of training on very sparse videos and
demonstrate that a good generator can be learned by using as few as 2 frames
per clip. After that, we rethink the traditional image and video discriminators
pair and propose to use a single hypernetwork-based one. This decreases the
training cost and provides richer learning signal to the generator, making it
possible to train directly on 1024$^2$ videos for the first time. We build our
model on top of StyleGAN2 and it is just 5% more expensive to train at the same
resolution while achieving almost the same image quality. Moreover, our latent
space features similar properties, enabling spatial manipulations that our
method can propagate in time. We can generate arbitrarily long videos at
arbitrary high frame rate, while prior work struggles to generate even 64
frames at a fixed rate. Our model achieves state-of-the-art results on four
modern 256$^2$ video synthesis benchmarks and one 1024$^2$ resolution one.
Videos and the source code are available at the project website:
https://universome.github.io/stylegan-v.
Related papers
- REDUCIO! Generating 1024$\times$1024 Video within 16 Seconds using Extremely Compressed Motion Latents [110.41795676048835]
One crucial obstacle for large-scale applications is the expensive training and inference cost.
In this paper, we argue that videos contain much more redundant information than images, thus can be encoded by very few motion latents.
We train Reducio-DiT in around 3.2K training hours in total and generate a 16-frame 1024*1024 video clip within 15.5 seconds on a single A100 GPU.
arXiv Detail & Related papers (2024-11-20T18:59:52Z) - Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video
Synthesis [69.83405335645305]
We argue that naively bringing advances of image models to the video generation domain reduces motion fidelity, visual quality and impairs scalability.
In this work, we build Snap Video, a video-first model that systematically addresses these challenges.
We show that a U-Net - a workhorse behind image generation - scales poorly when generating videos, requiring significant computational overhead.
This allows us to efficiently train a text-to-video model with billions of parameters for the first time, reach state-of-the-art results on a number of benchmarks, and generate videos with substantially higher quality, temporal consistency, and motion complexity.
arXiv Detail & Related papers (2024-02-22T18:55:08Z) - Is a Video worth $n\times n$ Images? A Highly Efficient Approach to
Transformer-based Video Question Answering [14.659023742381777]
Conventional Transformer-based Video Question Answering (VideoQA) approaches generally encode frames independently through one or more image encoders followed by interaction between frames and question.
We present a highly efficient approach for VideoQA based on existing vision-language pre-trained models where we video frames to a $ntimes n$ matrix and then convert it to one image.
arXiv Detail & Related papers (2023-05-16T02:12:57Z) - Towards Smooth Video Composition [59.134911550142455]
Video generation requires consistent and persistent frames with dynamic content over time.
This work investigates modeling the temporal relations for composing video with arbitrary length, from a few frames to even infinite, using generative adversarial networks (GANs)
We show that the alias-free operation for single image generation, together with adequately pre-learned knowledge, brings a smooth frame transition without compromising the per-frame quality.
arXiv Detail & Related papers (2022-12-14T18:54:13Z) - Talking Head from Speech Audio using a Pre-trained Image Generator [5.659018934205065]
We propose a novel method for generating high-resolution videos of talking-heads from speech audio and a single 'identity' image.
We model each frame as a point in the latent space of StyleGAN so that a video corresponds to a trajectory through the latent space.
We train a recurrent neural network to map from speech utterances to displacements in the latent space of the image generator.
arXiv Detail & Related papers (2022-09-09T11:20:37Z) - Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive
Transformer [66.56167074658697]
We present a method that builds on 3D-VQGAN and transformers to generate videos with thousands of frames.
Our evaluation shows that our model trained on 16-frame video clips can generate diverse, coherent, and high-quality long videos.
We also showcase conditional extensions of our approach for generating meaningful long videos by incorporating temporal information with text and audio.
arXiv Detail & Related papers (2022-04-07T17:59:02Z) - Diverse Generation from a Single Video Made Possible [24.39972895902724]
We present a fast and practical method for video generation and manipulation from a single natural video.
Our method generates more realistic and higher quality results than single-video GANs.
arXiv Detail & Related papers (2021-09-17T15:12:17Z) - A Good Image Generator Is What You Need for High-Resolution Video
Synthesis [73.82857768949651]
We present a framework that leverages contemporary image generators to render high-resolution videos.
We frame the video synthesis problem as discovering a trajectory in the latent space of a pre-trained and fixed image generator.
We introduce a motion generator that discovers the desired trajectory, in which content and motion are disentangled.
arXiv Detail & Related papers (2021-04-30T15:38:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.