Towards the generation of synchronized and believable non-verbal facial
behaviors of a talking virtual agent
- URL: http://arxiv.org/abs/2311.12804v1
- Date: Fri, 15 Sep 2023 07:25:36 GMT
- Title: Towards the generation of synchronized and believable non-verbal facial
behaviors of a talking virtual agent
- Authors: Alice Delbosc (TALEP, LIS, AMU), Magalie Ochs (LIS, AMU, TALEP),
Nicolas Sabouret (LISN), Brian Ravenet (LISN), St\'ephane Ayache (AMU, LIS,
QARMA)
- Abstract summary: This paper introduces a new model to generate rhythmically relevant non-verbal facial behaviors for virtual agents while they speak.
We find that training the model with two different sets of data, instead of one, did not necessarily improve its performance.
We also show that employing an adversarial model, in which fabricated fake examples are introduced during the training phase, increases the perception of synchronization with speech.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces a new model to generate rhythmically relevant
non-verbal facial behaviors for virtual agents while they speak. The model
demonstrates perceived performance comparable to behaviors directly extracted
from the data and replayed on a virtual agent, in terms of synchronization with
speech and believability. Interestingly, we found that training the model with
two different sets of data, instead of one, did not necessarily improve its
performance. The expressiveness of the people in the dataset and the shooting
conditions are key elements. We also show that employing an adversarial model,
in which fabricated fake examples are introduced during the training phase,
increases the perception of synchronization with speech. A collection of videos
demonstrating the results and code can be accessed at:
https://github.com/aldelb/non_verbal_facial_animation.
Related papers
- Perceptually Accurate 3D Talking Head Generation: New Definitions, Speech-Mesh Representation, and Evaluation Metrics [14.290468730787772]
We introduce a speech-mesh synchronized representation that captures intricate correspondences between speech signals and 3D face meshes.
Experiments show that training 3D talking head generation models with our perceptual loss significantly improve all three aspects of perceptually accurate lip synchronization.
arXiv Detail & Related papers (2025-03-26T08:18:57Z) - InterAct: Capture and Modelling of Realistic, Expressive and Interactive Activities between Two Persons in Daily Scenarios [12.300105542672163]
We capture 241 motion sequences where two persons perform a realistic scenario over the whole sequence.
The audios, body motions, and facial expressions of both persons are all captured in our dataset.
We also demonstrate the first diffusion model based approach that directly estimates the interactive motions between two persons from their audios alone.
arXiv Detail & Related papers (2024-05-19T22:35:02Z) - AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding [24.486705010561067]
The paper introduces AniTalker, a framework designed to generate lifelike talking faces from a single portrait.
AniTalker effectively captures a wide range of facial dynamics, including subtle expressions and head movements.
arXiv Detail & Related papers (2024-05-06T02:32:41Z) - Dyadic Interaction Modeling for Social Behavior Generation [6.626277726145613]
We present an effective framework for creating 3D facial motions in dyadic interactions.
The heart of our framework is Dyadic Interaction Modeling (DIM), a pre-training approach.
Experiments demonstrate the superiority of our framework in generating listener motions.
arXiv Detail & Related papers (2024-03-14T03:21:33Z) - DualTalker: A Cross-Modal Dual Learning Approach for Speech-Driven 3D
Facial Animation [10.73030153404956]
We propose a cross-modal dual-learning framework, termed DualTalker, to improve data usage efficiency.
The framework is trained jointly with the primary task (audio-driven facial animation) and its dual task (lip reading) and shares common audio/motion encoder components.
Our approach outperforms current state-of-the-art methods both qualitatively and quantitatively.
arXiv Detail & Related papers (2023-11-08T15:39:56Z) - Generative Pre-training for Speech with Flow Matching [81.59952572752248]
We pre-trained a generative model, named SpeechFlow, on 60k hours of untranscribed speech with Flow Matching and masked conditions.
Experiment results show the pre-trained generative model can be fine-tuned with task-specific data to match or surpass existing expert models on speech enhancement, separation, and synthesis.
arXiv Detail & Related papers (2023-10-25T03:40:50Z) - GestSync: Determining who is speaking without a talking head [67.75387744442727]
We introduce Gesture-Sync: determining if a person's gestures are correlated with their speech or not.
In comparison to Lip-Sync, Gesture-Sync is far more challenging as there is a far looser relationship between the voice and body movement.
We show that the model can be trained using self-supervised learning alone, and evaluate its performance on the LRS3 dataset.
arXiv Detail & Related papers (2023-10-08T22:48:30Z) - Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a
Short Video [91.92782707888618]
We present a decomposition-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance.
We show that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization.
arXiv Detail & Related papers (2023-09-09T14:52:39Z) - Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation [54.68893964373141]
Talking face generation has historically struggled to produce head movements and natural facial expressions without guidance from additional reference videos.
Recent developments in diffusion-based generative models allow for more realistic and stable data synthesis.
We present an autoregressive diffusion model that requires only one identity image and audio sequence to generate a video of a realistic talking human head.
arXiv Detail & Related papers (2023-01-06T14:16:54Z) - VIRT: Improving Representation-based Models for Text Matching through
Virtual Interaction [50.986371459817256]
We propose a novel textitVirtual InteRacTion mechanism, termed as VIRT, to enable full and deep interaction modeling in representation-based models.
VIRT asks representation-based encoders to conduct virtual interactions to mimic the behaviors as interaction-based models do.
arXiv Detail & Related papers (2021-12-08T09:49:28Z) - Pose-Controllable Talking Face Generation by Implicitly Modularized
Audio-Visual Representation [96.66010515343106]
We propose a clean yet effective framework to generate pose-controllable talking faces.
We operate on raw face images, using only a single photo as an identity reference.
Our model has multiple advanced capabilities including extreme view robustness and talking face frontalization.
arXiv Detail & Related papers (2021-04-22T15:10:26Z) - Neural Face Models for Example-Based Visual Speech Synthesis [2.2817442144155207]
We present a marker-less approach for facial motion capture based on multi-view video.
We learn a neural representation of facial expressions, which is used to seamlessly facial performances during the animation procedure.
arXiv Detail & Related papers (2020-09-22T07:35:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.