StyleVideoGAN: A Temporal Generative Model using a Pretrained StyleGAN
- URL: http://arxiv.org/abs/2107.07224v1
- Date: Thu, 15 Jul 2021 09:58:15 GMT
- Title: StyleVideoGAN: A Temporal Generative Model using a Pretrained StyleGAN
- Authors: Gereon Fox and Ayush Tewari and Mohamed Elgharib and Christian
Theobalt
- Abstract summary: We present a novel approach to the video synthesis problem that helps to greatly improve visual quality.
We make use of a pre-trained StyleGAN network, the latent space of which allows control over the appearance of the objects it was trained for.
Our temporal architecture is then trained not on sequences of RGB frames, but on sequences of StyleGAN latent codes.
- Score: 70.31913835035206
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generative adversarial models (GANs) continue to produce advances in terms of
the visual quality of still images, as well as the learning of temporal
correlations. However, few works manage to combine these two interesting
capabilities for the synthesis of video content: Most methods require an
extensive training dataset in order to learn temporal correlations, while being
rather limited in the resolution and visual quality of their output frames. In
this paper, we present a novel approach to the video synthesis problem that
helps to greatly improve visual quality and drastically reduce the amount of
training data and resources necessary for generating video content. Our
formulation separates the spatial domain, in which individual frames are
synthesized, from the temporal domain, in which motion is generated. For the
spatial domain we make use of a pre-trained StyleGAN network, the latent space
of which allows control over the appearance of the objects it was trained for.
The expressive power of this model allows us to embed our training videos in
the StyleGAN latent space. Our temporal architecture is then trained not on
sequences of RGB frames, but on sequences of StyleGAN latent codes. The
advantageous properties of the StyleGAN space simplify the discovery of
temporal correlations. We demonstrate that it suffices to train our temporal
architecture on only 10 minutes of footage of 1 subject for about 6 hours.
After training, our model can not only generate new portrait videos for the
training subject, but also for any random subject which can be embedded in the
StyleGAN space.
Related papers
- MotionAura: Generating High-Quality and Motion Consistent Videos using Discrete Diffusion [3.7270979204213446]
We present four key contributions to address the challenges of video processing.
First, we introduce the 3D Inverted Vector-Quantization Variencoenco Autocoder.
Second, we present MotionAura, a text-to-video generation framework.
Third, we propose a spectral transformer-based denoising network.
Fourth, we introduce a downstream task of Sketch Guided Videopainting.
arXiv Detail & Related papers (2024-10-10T07:07:56Z) - Lumiere: A Space-Time Diffusion Model for Video Generation [75.54967294846686]
We introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once.
This is in contrast to existing video models which synthesize distants followed by temporal super-resolution.
By deploying both spatial and (importantly) temporal down- and up-sampling, our model learns to directly generate a full-frame-rate, low-resolution video.
arXiv Detail & Related papers (2024-01-23T18:05:25Z) - Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation [55.36617538438858]
We propose a novel approach that strengthens the interaction between spatial and temporal perceptions.
We curate a large-scale and open-source video dataset called HD-VG-130M.
arXiv Detail & Related papers (2023-05-18T11:06:15Z) - Leaping Into Memories: Space-Time Deep Feature Synthesis [93.10032043225362]
We propose LEAPS, an architecture-independent method for synthesizing videos from internal models.
We quantitatively and qualitatively evaluate the applicability of LEAPS by inverting a range of architectures convolutional attention-based on Kinetics-400.
arXiv Detail & Related papers (2023-03-17T12:55:22Z) - Towards Smooth Video Composition [59.134911550142455]
Video generation requires consistent and persistent frames with dynamic content over time.
This work investigates modeling the temporal relations for composing video with arbitrary length, from a few frames to even infinite, using generative adversarial networks (GANs)
We show that the alias-free operation for single image generation, together with adequately pre-learned knowledge, brings a smooth frame transition without compromising the per-frame quality.
arXiv Detail & Related papers (2022-12-14T18:54:13Z) - Learning Fine-Grained Visual Understanding for Video Question Answering
via Decoupling Spatial-Temporal Modeling [28.530765643908083]
We decouple spatial-temporal modeling and integrate an image- and a video-language to learn fine-grained visual understanding.
We propose a novel pre-training objective, Temporal Referring Modeling, which requires the model to identify temporal positions of events in video sequences.
Our model outperforms previous work pre-trained on orders of magnitude larger datasets.
arXiv Detail & Related papers (2022-10-08T07:03:31Z) - Time Is MattEr: Temporal Self-supervision for Video Transformers [72.42240984211283]
We design simple yet effective self-supervised tasks for video models to learn temporal dynamics better.
Our method learns the temporal order of video frames as extra self-supervision and enforces the randomly shuffled frames to have low-confidence outputs.
Under various video action recognition tasks, we demonstrate the effectiveness of our method and its compatibility with state-of-the-art Video Transformers.
arXiv Detail & Related papers (2022-07-19T04:44:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.