Encode-in-Style: Latent-based Video Encoding using StyleGAN2
- URL: http://arxiv.org/abs/2203.14512v1
- Date: Mon, 28 Mar 2022 05:44:19 GMT
- Title: Encode-in-Style: Latent-based Video Encoding using StyleGAN2
- Authors: Trevine Oorloff, Yaser Yacoob
- Abstract summary: We propose an end-to-end facial video encoding approach that facilitates data-efficient high-quality video re-synthesis.
The approach builds on StyleGAN2 image inversion and multi-stage non-linear latent-space editing to generate videos that are nearly comparable to input videos.
- Score: 0.7614628596146599
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose an end-to-end facial video encoding approach that facilitates
data-efficient high-quality video re-synthesis by optimizing low-dimensional
edits of a single Identity-latent. The approach builds on StyleGAN2 image
inversion and multi-stage non-linear latent-space editing to generate videos
that are nearly comparable to input videos. It economically captures face
identity, head-pose, and complex facial motions at fine levels, and thereby
bypasses training and person modeling which tend to hamper many re-synthesis
approaches. The approach is designed with maximum data efficiency, where a
single W+ latent and 35 parameters per frame enable high-fidelity video
rendering. This pipeline can also be used for puppeteering (i.e., motion
transfer).
Related papers
- I4VGen: Image as Stepping Stone for Text-to-Video Generation [22.3850273729521]
I4VGen is a training-free and plug-and-play video diffusion inference framework.
It enhances text-to-video generation by leveraging robust image techniques.
I4VGen produces videos with higher visual realism and textual fidelity.
arXiv Detail & Related papers (2024-06-04T11:48:44Z) - Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation [35.52770785430601]
We propose a novel hybrid video autoencoder, called HVtemporalDM, which can capture intricate dependencies more effectively.
The HVDM is trained by a hybrid video autoencoder which extracts a disentangled representation of the video.
Our hybrid autoencoder provide a more comprehensive video latent enriching the generated videos with fine structures and details.
arXiv Detail & Related papers (2024-02-21T11:46:16Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane
Networks [63.84589410872608]
We present a novel unconditional video generative model designed to address long-term spatial and temporal dependencies.
Our approach reduces computational complexity by a factor of $2$ as measured in FLOPs.
Our model is capable of synthesizing high-fidelity video clips at a resolution of $256times256$ pixels, with durations extending to more than $5$ seconds at a frame rate of 30 fps.
arXiv Detail & Related papers (2024-01-11T16:48:44Z) - A Simple Recipe for Contrastively Pre-training Video-First Encoders
Beyond 16 Frames [54.90226700939778]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.
We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z) - GenDeF: Learning Generative Deformation Field for Video Generation [89.49567113452396]
We propose to render a video by warping one static image with a generative deformation field (GenDeF)
Such a pipeline enjoys three appealing advantages.
arXiv Detail & Related papers (2023-12-07T18:59:41Z) - I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion
Models [54.99771394322512]
Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models.
It still challenges encounters in terms of semantic accuracy, clarity, and continuity-temporal continuity.
We propose a cascaded I2VGen-XL approach that enhances model performance by decoupling these two factors.
I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos.
arXiv Detail & Related papers (2023-11-07T17:16:06Z) - StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via
Pretrained StyleGAN [49.917296433657484]
One-shot talking face generation aims at synthesizing a high-quality talking face video from an arbitrary portrait image.
In this work, we investigate the latent feature space of a pre-trained StyleGAN and discover some excellent spatial transformation properties.
We propose a novel unified framework based on a pre-trained StyleGAN that enables a set of powerful functionalities.
arXiv Detail & Related papers (2022-03-08T12:06:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.