From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations
        - URL: http://arxiv.org/abs/2401.01885v1
 - Date: Wed, 3 Jan 2024 18:55:16 GMT
 - Title: From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations
 - Authors: Evonne Ng, Javier Romero, Timur Bagautdinov, Shaojie Bai, Trevor
  Darrell, Angjoo Kanazawa, Alexander Richard
 - Abstract summary: Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands.
We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures.
 Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods.
 - Score: 107.88375243135579
 - License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
 - Abstract:   We present a framework for generating full-bodied photorealistic avatars that
gesture according to the conversational dynamics of a dyadic interaction. Given
speech audio, we output multiple possibilities of gestural motion for an
individual, including face, body, and hands. The key behind our method is in
combining the benefits of sample diversity from vector quantization with the
high-frequency details obtained through diffusion to generate more dynamic,
expressive motion. We visualize the generated motion using highly
photorealistic avatars that can express crucial nuances in gestures (e.g.
sneers and smirks). To facilitate this line of research, we introduce a
first-of-its-kind multi-view conversational dataset that allows for
photorealistic reconstruction. Experiments show our model generates appropriate
and diverse gestures, outperforming both diffusion- and VQ-only methods.
Furthermore, our perceptual evaluation highlights the importance of
photorealism (vs. meshes) in accurately assessing subtle motion details in
conversational gestures. Code and dataset available online.
 
       
      
        Related papers
        - Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale   Dataset [113.25650486482762]
We introduce the Seamless Interaction dataset, a large-scale collection of over 4,000 hours of face-to-face interaction footage.<n>This dataset enables the development of AI technologies that understand dyadic embodied dynamics.<n>We develop a suite of models that utilize the dataset to generate dyadic motion gestures and facial expressions aligned with human speech.
arXiv  Detail & Related papers  (2025-06-27T18:09:49Z) - X-Dyna: Expressive Dynamic Human Image Animation [49.896933584815926]
X-Dyna is a zero-shot, diffusion-based pipeline for animating a single human image.
It generates realistic, context-aware dynamics for both the subject and the surrounding environment.
arXiv  Detail & Related papers  (2025-01-17T08:10:53Z) - GaussianSpeech: Audio-Driven Gaussian Avatars [76.10163891172192]
We introduce GaussianSpeech, a novel approach that synthesizes high-fidelity animation sequences of photo-realistic, personalized 3D human head avatars from spoken audio.
We propose a compact and efficient 3DGS-based avatar representation that generates expression-dependent color and leverages wrinkle- and perceptually-based losses to synthesize facial details.
arXiv  Detail & Related papers  (2024-11-27T18:54:08Z) - Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective   Face and Body Expressions from Affordable Inputs [67.27840327499625]
We present a multimodal learning-based method to simultaneously synthesize co-speech facial expressions and upper-body gestures for digital characters.
Our approach learns from sparse face landmarks and upper-body joints, estimated directly from video data, to generate plausible emotive character motions.
arXiv  Detail & Related papers  (2024-06-26T04:53:11Z) - Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image   Animation [29.87407471246318]
This research delves into the complexities of synchronizing facial movements and creating visually appealing, temporally consistent animations.
Our innovative approach embraces the end-to-end diffusion paradigm and introduces a hierarchical audio-driven visual synthesis module.
The proposed hierarchical audio-driven visual synthesis offers adaptive control over expression and pose diversity, enabling more effective personalization tailored to different identities.
arXiv  Detail & Related papers  (2024-06-13T04:33:20Z) - Dyadic Interaction Modeling for Social Behavior Generation [6.626277726145613]
We present an effective framework for creating 3D facial motions in dyadic interactions.
The heart of our framework is Dyadic Interaction Modeling (DIM), a pre-training approach.
Experiments demonstrate the superiority of our framework in generating listener motions.
arXiv  Detail & Related papers  (2024-03-14T03:21:33Z) - FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head   Models [85.16273912625022]
We introduce FaceTalk, a novel generative approach designed for synthesizing high-fidelity 3D motion sequences of talking human heads from audio signal.
To the best of our knowledge, this is the first work to propose a generative approach for realistic and high-quality motion synthesis of human heads.
arXiv  Detail & Related papers  (2023-12-13T19:01:07Z) - AgentAvatar: Disentangling Planning, Driving and Rendering for
  Photorealistic Avatar Agents [16.544688997764293]
Our framework harnesses LLMs to produce a series of detailed text descriptions of the avatar agents' facial motions.
These descriptions are processed by our task-agnostic driving engine into continuous motion embeddings.
Our framework adapts to a variety of non-verbal avatar interactions, both monadic and dyadic.
arXiv  Detail & Related papers  (2023-11-29T09:13:00Z) - Pose-Controllable 3D Facial Animation Synthesis using Hierarchical
  Audio-Vertex Attention [52.63080543011595]
A novel pose-controllable 3D facial animation synthesis method is proposed by utilizing hierarchical audio-vertex attention.
The proposed method can produce more realistic facial expressions and head posture movements.
arXiv  Detail & Related papers  (2023-02-24T09:36:31Z) - Drivable Volumetric Avatars using Texel-Aligned Features [52.89305658071045]
Photo telepresence requires both high-fidelity body modeling and faithful driving to enable dynamically synthesized appearance.
We propose an end-to-end framework that addresses two core challenges in modeling and driving full-body avatars of real people.
arXiv  Detail & Related papers  (2022-07-20T09:28:16Z) - Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv  Detail & Related papers  (2022-04-18T17:58:04Z) - Dynamic Neural Radiance Fields for Monocular 4D Facial Avatar
  Reconstruction [9.747648609960185]
We present dynamic neural radiance fields for modeling the appearance and dynamics of a human face.
Especially, for telepresence applications in AR or VR, a faithful reproduction of the appearance including novel viewpoints or head-poses is required.
arXiv  Detail & Related papers  (2020-12-05T16:01:16Z) 
        This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.