MFR-Net: Multi-faceted Responsive Listening Head Generation via
Denoising Diffusion Model
- URL: http://arxiv.org/abs/2308.16635v1
- Date: Thu, 31 Aug 2023 11:10:28 GMT
- Title: MFR-Net: Multi-faceted Responsive Listening Head Generation via
Denoising Diffusion Model
- Authors: Jin Liu, Xi Wang, Xiaomeng Fu, Yesheng Chai, Cai Yu, Jiao Dai, Jizhong
Han
- Abstract summary: Responsive listening head generation is an important task that aims to model face-to-face communication scenarios.
We propose the textbfMulti-textbfFaceted textbfResponsive Listening Head Generation Network (MFR-Net)
- Score: 14.220727407255966
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Face-to-face communication is a common scenario including roles of speakers
and listeners. Most existing research methods focus on producing speaker
videos, while the generation of listener heads remains largely overlooked.
Responsive listening head generation is an important task that aims to model
face-to-face communication scenarios by generating a listener head video given
a speaker video and a listener head image. An ideal generated responsive
listening video should respond to the speaker with attitude or viewpoint
expressing while maintaining diversity in interaction patterns and accuracy in
listener identity information. To achieve this goal, we propose the
\textbf{M}ulti-\textbf{F}aceted \textbf{R}esponsive Listening Head Generation
Network (MFR-Net). Specifically, MFR-Net employs the probabilistic denoising
diffusion model to predict diverse head pose and expression features. In order
to perform multi-faceted response to the speaker video, while maintaining
accurate listener identity preservation, we design the Feature Aggregation
Module to boost listener identity features and fuse them with other
speaker-related features. Finally, a renderer finetuned with identity
consistency loss produces the final listening head videos. Our extensive
experiments demonstrate that MFR-Net not only achieves multi-faceted responses
in diversity and speaker identity information but also in attitude and
viewpoint expression.
Related papers
- Leveraging WaveNet for Dynamic Listening Head Modeling from Speech [11.016004057765185]
The creation of listener facial responses aims to simulate interactive communication feedback from a listener during a face-to-face conversation.
Our approach focuses on capturing the subtle nuances of listener feedback, ensuring the preservation of individual listener identity.
arXiv Detail & Related papers (2024-09-08T13:19:22Z) - GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained
3D Face Guidance [83.43852715997596]
GSmoothFace is a novel two-stage generalized talking face generation model guided by a fine-grained 3d face model.
It can synthesize smooth lip dynamics while preserving the speaker's identity.
Both quantitative and qualitative experiments confirm the superiority of our method in terms of realism, lip synchronization, and visual quality.
arXiv Detail & Related papers (2023-12-12T16:00:55Z) - Emotional Listener Portrait: Realistic Listener Motion Simulation in
Conversation [50.35367785674921]
Listener head generation centers on generating non-verbal behaviors of a listener in reference to the information delivered by a speaker.
A significant challenge when generating such responses is the non-deterministic nature of fine-grained facial expressions during a conversation.
We propose the Emotional Listener Portrait (ELP), which treats each fine-grained facial motion as a composition of several discrete motion-codewords.
Our ELP model can not only automatically generate natural and diverse responses toward a given speaker via sampling from the learned distribution but also generate controllable responses with a predetermined attitude.
arXiv Detail & Related papers (2023-09-29T18:18:32Z) - Hierarchical Semantic Perceptual Listener Head Video Generation: A
High-performance Pipeline [6.9329709955764045]
ViCo@2023 Conversational Head Generation Challenge in ACM Multimedia 2023 conference.
This paper is a technical report of ViCo@2023 Conversational Head Generation Challenge in ACM Multimedia 2023 conference.
arXiv Detail & Related papers (2023-07-19T08:16:34Z) - Emotional Talking Head Generation based on Memory-Sharing and
Attention-Augmented Networks [21.864200803678003]
We propose a talking head generation model consisting of a Memory-Sharing Emotion Feature extractor and an Attention-Augmented Translator based on U-net.
MSEF can extract implicit emotional auxiliary features from audio to estimate more accurate emotional face landmarks.
AATU acts as a translator between the estimated landmarks and the photo-realistic video frames.
arXiv Detail & Related papers (2023-06-06T11:31:29Z) - GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face
Synthesis [62.297513028116576]
GeneFace is a general and high-fidelity NeRF-based talking face generation method.
A head-aware torso-NeRF is proposed to eliminate the head-torso problem.
arXiv Detail & Related papers (2023-01-31T05:56:06Z) - DialogueNeRF: Towards Realistic Avatar Face-to-Face Conversation Video
Generation [54.84137342837465]
Face-to-face conversations account for the vast majority of daily conversations.
Most existing methods focused on single-person talking head generation.
We propose a novel unified framework based on neural radiance field (NeRF)
arXiv Detail & Related papers (2022-03-15T14:16:49Z) - Responsive Listening Head Generation: A Benchmark Dataset and Baseline [58.168958284290156]
We define the responsive listening head generation task as the synthesis of a non-verbal head with motions and expressions reacting to the multiple inputs.
Unlike speech-driven gesture or talking head generation, we introduce more modals in this task, hoping to benefit several research fields.
arXiv Detail & Related papers (2021-12-27T07:18:50Z) - Pose-Controllable Talking Face Generation by Implicitly Modularized
Audio-Visual Representation [96.66010515343106]
We propose a clean yet effective framework to generate pose-controllable talking faces.
We operate on raw face images, using only a single photo as an identity reference.
Our model has multiple advanced capabilities including extreme view robustness and talking face frontalization.
arXiv Detail & Related papers (2021-04-22T15:10:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.