Learning Landmarks Motion from Speech for Speaker-Agnostic 3D Talking
Heads Generation
- URL: http://arxiv.org/abs/2306.01415v2
- Date: Wed, 26 Jul 2023 14:53:23 GMT
- Title: Learning Landmarks Motion from Speech for Speaker-Agnostic 3D Talking
Heads Generation
- Authors: Federico Nocentini, Claudio Ferrari, Stefano Berretti
- Abstract summary: This paper presents a novel approach for generating 3D talking heads from raw audio inputs.
Using landmarks in 3D talking head generation offers various advantages such as consistency, reliability, and obviating the need for manual-annotation.
- Score: 9.242997749920498
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a novel approach for generating 3D talking heads from raw
audio inputs. Our method grounds on the idea that speech related movements can
be comprehensively and efficiently described by the motion of a few control
points located on the movable parts of the face, i.e., landmarks. The
underlying musculoskeletal structure then allows us to learn how their motion
influences the geometrical deformations of the whole face. The proposed method
employs two distinct models to this aim: the first one learns to generate the
motion of a sparse set of landmarks from the given audio. The second model
expands such landmarks motion to a dense motion field, which is utilized to
animate a given 3D mesh in neutral state. Additionally, we introduce a novel
loss function, named Cosine Loss, which minimizes the angle between the
generated motion vectors and the ground truth ones. Using landmarks in 3D
talking head generation offers various advantages such as consistency,
reliability, and obviating the need for manual-annotation. Our approach is
designed to be identity-agnostic, enabling high-quality facial animations for
any users without additional data or training.
Related papers
- KMTalk: Speech-Driven 3D Facial Animation with Key Motion Embedding [19.15471840100407]
We present a novel approach for synthesizing 3D facial motions from audio sequences using key motion embeddings.
Our method integrates linguistic and data-driven priors through two modules: the linguistic-based key motion acquisition and the cross-modal motion completion.
The latter extends key motions into a full sequence of 3D talking faces guided by audio features, improving temporal coherence and audio-visual consistency.
arXiv Detail & Related papers (2024-09-02T09:41:24Z) - NeRFFaceSpeech: One-shot Audio-driven 3D Talking Head Synthesis via Generative Prior [5.819784482811377]
We propose a novel method, NeRFFaceSpeech, which enables to produce high-quality 3D-aware talking head.
Our method can craft a 3D-consistent facial feature space corresponding to a single image.
We also introduce LipaintNet that can replenish the lacking information in the inner-mouth area.
arXiv Detail & Related papers (2024-05-09T13:14:06Z) - High-Fidelity and Freely Controllable Talking Head Video Generation [31.08828907637289]
We propose a novel model that produces high-fidelity talking head videos with free control over head pose and expression.
We introduce a novel motion-aware multi-scale feature alignment module to effectively transfer the motion without face distortion.
We evaluate our model on challenging datasets and demonstrate its state-of-the-art performance.
arXiv Detail & Related papers (2023-04-20T09:02:41Z) - Pose-Controllable 3D Facial Animation Synthesis using Hierarchical
Audio-Vertex Attention [52.63080543011595]
A novel pose-controllable 3D facial animation synthesis method is proposed by utilizing hierarchical audio-vertex attention.
The proposed method can produce more realistic facial expressions and head posture movements.
arXiv Detail & Related papers (2023-02-24T09:36:31Z) - Generating Holistic 3D Human Motion from Speech [97.11392166257791]
We build a high-quality dataset of 3D holistic body meshes with synchronous speech.
We then define a novel speech-to-motion generation framework in which the face, body, and hands are modeled separately.
arXiv Detail & Related papers (2022-12-08T17:25:19Z) - SadTalker: Learning Realistic 3D Motion Coefficients for Stylized
Audio-Driven Single Image Talking Face Animation [33.651156455111916]
We present SadTalker, which generates 3D motion coefficients (head pose, expression) of the 3DMM from audio.
Precisely, we present ExpNet to learn the accurate facial expression from audio by distilling both coefficients and 3D-rendered faces.
arXiv Detail & Related papers (2022-11-22T11:35:07Z) - Controllable Radiance Fields for Dynamic Face Synthesis [125.48602100893845]
We study how to explicitly control generative model synthesis of face dynamics exhibiting non-rigid motion.
Controllable Radiance Field (CoRF)
On head image/video data we show that CoRFs are 3D-aware while enabling editing of identity, viewing directions, and motion.
arXiv Detail & Related papers (2022-10-11T23:17:31Z) - Copy Motion From One to Another: Fake Motion Video Generation [53.676020148034034]
A compelling application of artificial intelligence is to generate a video of a target person performing arbitrary desired motion.
Current methods typically employ GANs with a L2 loss to assess the authenticity of the generated videos.
We propose a theoretically motivated Gromov-Wasserstein loss that facilitates learning the mapping from a pose to a foreground image.
Our method is able to generate realistic target person videos, faithfully copying complex motions from a source person.
arXiv Detail & Related papers (2022-05-03T08:45:22Z) - MeshTalk: 3D Face Animation from Speech using Cross-Modality
Disentanglement [142.9900055577252]
We propose a generic audio-driven facial animation approach that achieves highly realistic motion synthesis results for the entire face.
Our approach ensures highly accurate lip motion, while also plausible animation of the parts of the face that are uncorrelated to the audio signal, such as eye blinks and eye brow motion.
arXiv Detail & Related papers (2021-04-16T17:05:40Z) - Learning Speech-driven 3D Conversational Gestures from Video [106.15628979352738]
We propose the first approach to automatically and jointly synthesize both the synchronous 3D conversational body and hand gestures.
Our algorithm uses a CNN architecture that leverages the inherent correlation between facial expression and hand gestures.
We also contribute a new way to create a large corpus of more than 33 hours of annotated body, hand, and face data from in-the-wild videos of talking people.
arXiv Detail & Related papers (2021-02-13T01:05:39Z) - Speech2Video Synthesis with 3D Skeleton Regularization and Expressive
Body Poses [36.00309828380724]
We propose a novel approach to convert given speech audio to a photo-realistic speaking video of a specific person.
We achieve this by first generating 3D skeleton movements from the audio sequence using a recurrent neural network (RNN)
To make the skeleton movement realistic and expressive, we embed the knowledge of an articulated 3D human skeleton and a learned dictionary of personal speech iconic gestures into the generation process.
arXiv Detail & Related papers (2020-07-17T19:30:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.