From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations
- URL: http://arxiv.org/abs/2401.01885v1
- Date: Wed, 3 Jan 2024 18:55:16 GMT
- Title: From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations
- Authors: Evonne Ng, Javier Romero, Timur Bagautdinov, Shaojie Bai, Trevor
Darrell, Angjoo Kanazawa, Alexander Richard
- Abstract summary: Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands.
We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures.
Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods.
- Score: 107.88375243135579
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a framework for generating full-bodied photorealistic avatars that
gesture according to the conversational dynamics of a dyadic interaction. Given
speech audio, we output multiple possibilities of gestural motion for an
individual, including face, body, and hands. The key behind our method is in
combining the benefits of sample diversity from vector quantization with the
high-frequency details obtained through diffusion to generate more dynamic,
expressive motion. We visualize the generated motion using highly
photorealistic avatars that can express crucial nuances in gestures (e.g.
sneers and smirks). To facilitate this line of research, we introduce a
first-of-its-kind multi-view conversational dataset that allows for
photorealistic reconstruction. Experiments show our model generates appropriate
and diverse gestures, outperforming both diffusion- and VQ-only methods.
Furthermore, our perceptual evaluation highlights the importance of
photorealism (vs. meshes) in accurately assessing subtle motion details in
conversational gestures. Code and dataset available online.
Related papers
- Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs [67.27840327499625]
We present a multimodal learning-based method to simultaneously synthesize co-speech facial expressions and upper-body gestures for digital characters.
Our approach learns from sparse face landmarks and upper-body joints, estimated directly from video data, to generate plausible emotive character motions.
arXiv Detail & Related papers (2024-06-26T04:53:11Z) - Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation [29.87407471246318]
This research delves into the complexities of synchronizing facial movements and creating visually appealing, temporally consistent animations.
Our innovative approach embraces the end-to-end diffusion paradigm and introduces a hierarchical audio-driven visual synthesis module.
The proposed hierarchical audio-driven visual synthesis offers adaptive control over expression and pose diversity, enabling more effective personalization tailored to different identities.
arXiv Detail & Related papers (2024-06-13T04:33:20Z) - Dyadic Interaction Modeling for Social Behavior Generation [6.626277726145613]
We present an effective framework for creating 3D facial motions in dyadic interactions.
The heart of our framework is Dyadic Interaction Modeling (DIM), a pre-training approach.
Experiments demonstrate the superiority of our framework in generating listener motions.
arXiv Detail & Related papers (2024-03-14T03:21:33Z) - FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models [85.16273912625022]
We introduce FaceTalk, a novel generative approach designed for synthesizing high-fidelity 3D motion sequences of talking human heads from audio signal.
To the best of our knowledge, this is the first work to propose a generative approach for realistic and high-quality motion synthesis of human heads.
arXiv Detail & Related papers (2023-12-13T19:01:07Z) - AgentAvatar: Disentangling Planning, Driving and Rendering for
Photorealistic Avatar Agents [16.544688997764293]
Our framework harnesses LLMs to produce a series of detailed text descriptions of the avatar agents' facial motions.
These descriptions are processed by our task-agnostic driving engine into continuous motion embeddings.
Our framework adapts to a variety of non-verbal avatar interactions, both monadic and dyadic.
arXiv Detail & Related papers (2023-11-29T09:13:00Z) - Pose-Controllable 3D Facial Animation Synthesis using Hierarchical
Audio-Vertex Attention [52.63080543011595]
A novel pose-controllable 3D facial animation synthesis method is proposed by utilizing hierarchical audio-vertex attention.
The proposed method can produce more realistic facial expressions and head posture movements.
arXiv Detail & Related papers (2023-02-24T09:36:31Z) - Drivable Volumetric Avatars using Texel-Aligned Features [52.89305658071045]
Photo telepresence requires both high-fidelity body modeling and faithful driving to enable dynamically synthesized appearance.
We propose an end-to-end framework that addresses two core challenges in modeling and driving full-body avatars of real people.
arXiv Detail & Related papers (2022-07-20T09:28:16Z) - Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z) - Dynamic Neural Radiance Fields for Monocular 4D Facial Avatar
Reconstruction [9.747648609960185]
We present dynamic neural radiance fields for modeling the appearance and dynamics of a human face.
Especially, for telepresence applications in AR or VR, a faithful reproduction of the appearance including novel viewpoints or head-poses is required.
arXiv Detail & Related papers (2020-12-05T16:01:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.