StyleInV: A Temporal Style Modulated Inversion Network for Unconditional
Video Generation
- URL: http://arxiv.org/abs/2308.16909v1
- Date: Thu, 31 Aug 2023 17:59:33 GMT
- Title: StyleInV: A Temporal Style Modulated Inversion Network for Unconditional
Video Generation
- Authors: Yuhan Wang, Liming Jiang, Chen Change Loy
- Abstract summary: We introduce a novel motion generator design that uses a learning-based inversion network for GAN.
Our method supports style transfer with simple fine-tuning when the encoder is paired with a pretrained StyleGAN generator.
- Score: 73.54398908446906
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unconditional video generation is a challenging task that involves
synthesizing high-quality videos that are both coherent and of extended
duration. To address this challenge, researchers have used pretrained StyleGAN
image generators for high-quality frame synthesis and focused on motion
generator design. The motion generator is trained in an autoregressive manner
using heavy 3D convolutional discriminators to ensure motion coherence during
video generation. In this paper, we introduce a novel motion generator design
that uses a learning-based inversion network for GAN. The encoder in our method
captures rich and smooth priors from encoding images to latents, and given the
latent of an initially generated frame as guidance, our method can generate
smooth future latent by modulating the inversion encoder temporally. Our method
enjoys the advantage of sparse training and naturally constrains the generation
space of our motion generator with the inversion network guided by the initial
frame, eliminating the need for heavy discriminators. Moreover, our method
supports style transfer with simple fine-tuning when the encoder is paired with
a pretrained StyleGAN generator. Extensive experiments conducted on various
benchmarks demonstrate the superiority of our method in generating long and
high-resolution videos with decent single-frame quality and temporal
consistency.
Related papers
- RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane
Networks [63.84589410872608]
We present a novel unconditional video generative model designed to address long-term spatial and temporal dependencies.
Our approach reduces computational complexity by a factor of $2$ as measured in FLOPs.
Our model is capable of synthesizing high-fidelity video clips at a resolution of $256times256$ pixels, with durations extending to more than $5$ seconds at a frame rate of 30 fps.
arXiv Detail & Related papers (2024-01-11T16:48:44Z) - Faster Diffusion: Rethinking the Role of UNet Encoder in Diffusion
Models [95.47438940934413]
We conduct the first comprehensive study of the UNet encoder.
We find that encoder features change gently, whereas the decoder features exhibit substantial variations across different time-steps.
By benefiting from our propagation scheme, we are able to perform in parallel the decoder at certain adjacent time-steps.
arXiv Detail & Related papers (2023-12-15T08:46:43Z) - Latent-Shift: Latent Diffusion with Temporal Shift for Efficient
Text-to-Video Generation [115.09597127418452]
Latent-Shift is an efficient text-to-video generation method based on a pretrained text-to-image generation model.
We show that Latent-Shift achieves comparable or better results while being significantly more efficient.
arXiv Detail & Related papers (2023-04-17T17:57:06Z) - MotionVideoGAN: A Novel Video Generator Based on the Motion Space
Learned from Image Pairs [16.964371778504297]
We present MotionVideoGAN, a novel video generator synthesizing videos based on the motion space learned by pre-trained image pair generators.
Motion codes help us edit images within the motion space since the edited image shares the same contents with the other unchanged one in image pairs.
Our approach achieves state-of-the-art performance on the most complex video dataset ever used for unconditional video generation evaluation, UCF101.
arXiv Detail & Related papers (2023-03-06T05:52:13Z) - Towards Smooth Video Composition [59.134911550142455]
Video generation requires consistent and persistent frames with dynamic content over time.
This work investigates modeling the temporal relations for composing video with arbitrary length, from a few frames to even infinite, using generative adversarial networks (GANs)
We show that the alias-free operation for single image generation, together with adequately pre-learned knowledge, brings a smooth frame transition without compromising the per-frame quality.
arXiv Detail & Related papers (2022-12-14T18:54:13Z) - Generating Videos with Dynamics-aware Implicit Generative Adversarial
Networks [68.93429034530077]
We propose dynamics-aware implicit generative adversarial network (DIGAN) for video generation.
We show that DIGAN can be trained on 128 frame videos of 128x128 resolution, 80 frames longer than the 48 frames of the previous state-of-the-art method.
arXiv Detail & Related papers (2022-02-21T23:24:01Z) - Feature-Style Encoder for Style-Based GAN Inversion [1.9116784879310027]
We propose a novel architecture for GAN inversion, which we call Feature-Style encoder.
Our model achieves accurate inversion of real images from the latent space of a pre-trained style-based GAN model.
Thanks to its encoder structure, the model allows fast and accurate image editing.
arXiv Detail & Related papers (2022-02-04T15:19:34Z) - Autoencoding Video Latents for Adversarial Video Generation [0.0]
AVLAE is a two stream latent autoencoder where the video distribution is learned by adversarial training.
We demonstrate that our approach learns to disentangle motion and appearance codes even without the explicit structural composition in the generator.
arXiv Detail & Related papers (2022-01-18T11:42:14Z) - AE-StyleGAN: Improved Training of Style-Based Auto-Encoders [21.51697087024866]
StyleGANs have shown impressive results on data generation and manipulation in recent years.
In this paper, we focus on style-based generators asking a scientific question: Does forcing such a generator to reconstruct real data lead to more disentangled latent space and make the inversion process from image to latent space easy?
We describe a new methodology to train a style-based autoencoder where the encoder and generator are optimized end-to-end.
arXiv Detail & Related papers (2021-10-17T04:25:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.