Related papers: VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis

VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis

URL: http://arxiv.org/abs/2403.08764v1
Date: Wed, 13 Mar 2024 17:59:02 GMT
Title: VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis
Authors: Enric Corona, Andrei Zanfir, Eduard Gabriel Bazavan, Nikos Kolotouros, Thiemo Alldieck, Cristian Sminchisescu
Abstract summary: VLOGGER is a method for audio-driven human video generation from a single input image. We use a novel diffusion-based architecture that augments text-to-image models with both spatial and temporal controls. We show applications in video editing and personalization.
Score: 40.869862603815875
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose VLOGGER, a method for audio-driven human video generation from a single input image of a person, which builds on the success of recent generative diffusion models. Our method consists of 1) a stochastic human-to-3d-motion diffusion model, and 2) a novel diffusion-based architecture that augments text-to-image models with both spatial and temporal controls. This supports the generation of high quality video of variable length, easily controllable through high-level representations of human faces and bodies. In contrast to previous work, our method does not require training for each person, does not rely on face detection and cropping, generates the complete image (not just the face or the lips), and considers a broad spectrum of scenarios (e.g. visible torso or diverse subject identities) that are critical to correctly synthesize humans who communicate. We also curate MENTOR, a new and diverse dataset with 3d pose and expression annotations, one order of magnitude larger than previous ones (800,000 identities) and with dynamic gestures, on which we train and ablate our main technical contributions. VLOGGER outperforms state-of-the-art methods in three public benchmarks, considering image quality, identity preservation and temporal consistency while also generating upper-body gestures. We analyze the performance of VLOGGER with respect to multiple diversity metrics, showing that our architectural choices and the use of MENTOR benefit training a fair and unbiased model at scale. Finally we show applications in video editing and personalization.

Related papers

Interspatial Attention for Efficient 4D Human Video Generation [98.36274427702915]
We introduce a new interspatial attention (ISA) mechanism as a scalable building block for modern video generation models.<n>ISA is a new type of cross attention that uses relative positional encodings tailored for the generation of human videos.<n>Our model achieves state-of-the-art performance for 4D human video synthesis, demonstrating remarkable motion consistency and identity preservation.
arXiv Detail & Related papers (2025-05-21T17:53:47Z)
SVAD: From Single Image to 3D Avatar via Synthetic Data Generation with Video Diffusion and Data Augmentation [0.0]
High-quality animatable 3D human avatars from a single image remains a significant challenge in computer vision.<n>We present SVAD, a novel approach that addresses these limitations by leveraging complementary strengths of existing techniques.<n>Our method generates synthetic training data through video diffusion, enhances it with identity preservation and image restoration modules, and utilizes this refined data to train 3DGS avatars.
arXiv Detail & Related papers (2025-05-08T17:59:58Z)
Multi-identity Human Image Animation with Structural Video Diffusion [64.20452431561436]
We present Structural Video Diffusion, a novel framework for generating realistic multi-human videos. Our approach introduces identity-specific embeddings to maintain consistent appearances across individuals. We expand existing human video dataset with 25K new videos featuring diverse multi-human and object interaction scenarios.
arXiv Detail & Related papers (2025-04-05T10:03:49Z)
HumanGif: Single-View Human Diffusion with Generative Prior [25.516544735593087]
We propose HumanGif, a single-view human diffusion model with generative priors. Specifically, we formulate the single-view-based 3D human novel view and pose synthesis as a single-view-conditioned human diffusion process. We show that HumanGif achieves the best perceptual performance, with better generalizability for novel view and pose synthesis.
arXiv Detail & Related papers (2025-02-17T17:55:27Z)
Joint Learning of Depth and Appearance for Portrait Image Animation [20.83495988491606]
We propose to jointly learn the visual appearance and depth simultaneously in a diffusion-based portrait image generator. Our framework can be efficiently adapted to various downstream applications, such as facial depth-to-image and image-to-depth generation.
arXiv Detail & Related papers (2025-01-15T08:24:35Z)
Single Image, Any Face: Generalisable 3D Face Generation [59.9369171926757]
We propose a novel model, Gen3D-Face, which generates 3D human faces with unconstrained single image input. To the best of our knowledge, this is the first attempt and benchmark for creating photorealistic 3D human face avatars from single images.
arXiv Detail & Related papers (2024-09-25T14:56:37Z)
MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild [32.6521941706907]
We present MultiPly, a novel framework to reconstruct multiple people in 3D from monocular in-the-wild videos. We first define a layered neural representation for the entire scene, composited by individual human and background models. We learn the layered neural representation from videos via our layer-wise differentiable volume rendering.
arXiv Detail & Related papers (2024-06-03T17:59:57Z)
Towards Multi-Task Multi-Modal Models: A Video Generative Perspective [5.495245220300184]
This thesis chronicles our endeavor to build multi-task models for generating videos and other modalities under diverse conditions. We unveil a novel approach to mapping bidirectionally between visual observation and interpretable lexical terms. Our scalable visual token representation proves beneficial across generation, compression, and understanding tasks.
arXiv Detail & Related papers (2024-05-26T23:56:45Z)
CapHuman: Capture Your Moments in Parallel Universes [60.06408546134581]
We present a new framework named CapHuman. CapHuman encodes identity features and then learns to align them into the latent space. We introduce the 3D facial prior to equip our model with control over the human head in a flexible and 3D-consistent manner.
arXiv Detail & Related papers (2024-02-01T14:41:59Z)
Neural Novel Actor: Learning a Generalized Animatable Neural Representation for Human Actors [98.24047528960406]
We propose a new method for learning a generalized animatable neural representation from a sparse set of multi-view imagery of multiple persons. The learned representation can be used to synthesize novel view images of an arbitrary person from a sparse set of cameras, and further animate them with the user's pose control.
arXiv Detail & Related papers (2022-08-25T07:36:46Z)
Human Performance Capture from Monocular Video in the Wild [50.34917313325813]
We propose a method capable of capturing the dynamic 3D human shape from a monocular video featuring challenging body poses. Our method outperforms state-of-the-art methods on an in-the-wild human video dataset 3DPW.
arXiv Detail & Related papers (2021-11-29T16:32:41Z)
Image Comes Dancing with Collaborative Parsing-Flow Video Synthesis [124.48519390371636]
Transfering human motion from a source to a target person poses great potential in computer vision and graphics applications. Previous work has either relied on crafted 3D human models or trained a separate model specifically for each target person. This work studies a more general setting, in which we aim to learn a single model to parsimoniously transfer motion from a source video to any target person.
arXiv Detail & Related papers (2021-10-27T03:42:41Z)
Speech2Video Synthesis with 3D Skeleton Regularization and Expressive Body Poses [36.00309828380724]
We propose a novel approach to convert given speech audio to a photo-realistic speaking video of a specific person. We achieve this by first generating 3D skeleton movements from the audio sequence using a recurrent neural network (RNN) To make the skeleton movement realistic and expressive, we embed the knowledge of an articulated 3D human skeleton and a learned dictionary of personal speech iconic gestures into the generation process.
arXiv Detail & Related papers (2020-07-17T19:30:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.