DialogueNeRF: Towards Realistic Avatar Face-to-Face Conversation Video
Generation
- URL: http://arxiv.org/abs/2203.07931v2
- Date: Sat, 12 Aug 2023 14:45:58 GMT
- Title: DialogueNeRF: Towards Realistic Avatar Face-to-Face Conversation Video
Generation
- Authors: Yichao Yan, Zanwei Zhou, Zi Wang, Jingnan Gao, Xiaokang Yang
- Abstract summary: Face-to-face conversations account for the vast majority of daily conversations.
Most existing methods focused on single-person talking head generation.
We propose a novel unified framework based on neural radiance field (NeRF)
- Score: 54.84137342837465
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conversation is an essential component of virtual avatar activities in the
metaverse. With the development of natural language processing, textual and
vocal conversation generation has achieved a significant breakthrough. However,
face-to-face conversations account for the vast majority of daily
conversations, while most existing methods focused on single-person talking
head generation. In this work, we take a step further and consider generating
realistic face-to-face conversation videos. Conversation generation is more
challenging than single-person talking head generation, since it not only
requires generating photo-realistic individual talking heads but also demands
the listener to respond to the speaker. In this paper, we propose a novel
unified framework based on neural radiance field (NeRF) to address this task.
Specifically, we model both the speaker and listener with a NeRF framework,
with different conditions to control individual expressions. The speaker is
driven by the audio signal, while the response of the listener depends on both
visual and acoustic information. In this way, face-to-face conversation videos
are generated between human avatars, with all the interlocutors modeled within
the same network. Moreover, to facilitate future research on this task, we
collect a new human conversation dataset containing 34 clips of videos.
Quantitative and qualitative experiments evaluate our method in different
aspects, e.g., image quality, pose sequence trend, and naturalness of the
rendering videos. Experimental results demonstrate that the avatars in the
resulting videos are able to perform a realistic conversation, and maintain
individual styles. All the code, data, and models will be made publicly
available.
Related papers
- Neural Text to Articulate Talk: Deep Text to Audiovisual Speech
Synthesis achieving both Auditory and Photo-realism [26.180371869137257]
State of the art in talking face generation focuses mainly on lip-syncing, being conditioned on audio clips.
NEUral Text to ARticulate Talk (NEUTART) is a talking face generator that uses a joint audiovisual feature space.
Model produces photorealistic talking face videos with human-like articulation and well-synced audiovisual streams.
arXiv Detail & Related papers (2023-12-11T18:41:55Z) - Interactive Conversational Head Generation [68.76774230274076]
We introduce a new conversation head generation benchmark for synthesizing behaviors of a single interlocutor in a face-to-face conversation.
The capability to automatically synthesize interlocutors which can participate in long and multi-turn conversations is vital and offer benefits for various applications.
arXiv Detail & Related papers (2023-07-05T08:06:26Z) - Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis [66.43223397997559]
We aim to synthesize high-quality talking portrait videos corresponding to the input text.
This task has broad application prospects in the digital human industry but has not been technically achieved yet.
We introduce Adaptive Text-to-Talking Avatar (Ada-TTA), which designs a generic zero-shot multi-speaker Text-to-Speech model.
arXiv Detail & Related papers (2023-06-06T08:50:13Z) - Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z) - Responsive Listening Head Generation: A Benchmark Dataset and Baseline [58.168958284290156]
We define the responsive listening head generation task as the synthesis of a non-verbal head with motions and expressions reacting to the multiple inputs.
Unlike speech-driven gesture or talking head generation, we introduce more modals in this task, hoping to benefit several research fields.
arXiv Detail & Related papers (2021-12-27T07:18:50Z) - Write-a-speaker: Text-based Emotional and Rhythmic Talking-head
Generation [28.157431757281692]
We propose a text-based talking-head video generation framework that synthesizes high-fidelity facial expressions and head motions.
Our framework consists of a speaker-independent stage and a speaker-specific stage.
Our algorithm achieves high-quality photo-realistic talking-head videos including various facial expressions and head motions according to speech rhythms.
arXiv Detail & Related papers (2021-04-16T09:44:12Z) - Audio-driven Talking Face Video Generation with Learning-based
Personalized Head Pose [67.31838207805573]
We propose a deep neural network model that takes an audio signal A of a source person and a short video V of a target person as input.
We outputs a synthesized high-quality talking face video with personalized head pose.
Our method can generate high-quality talking face videos with more distinguishing head movement effects than state-of-the-art methods.
arXiv Detail & Related papers (2020-02-24T10:02:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.