Responsive Listening Head Generation: A Benchmark Dataset and Baseline
- URL: http://arxiv.org/abs/2112.13548v1
- Date: Mon, 27 Dec 2021 07:18:50 GMT
- Title: Responsive Listening Head Generation: A Benchmark Dataset and Baseline
- Authors: Mohan Zhou, Yalong Bai, Wei Zhang, Tiejun Zhao, Tao Mei
- Abstract summary: We define the responsive listening head generation task as the synthesis of a non-verbal head with motions and expressions reacting to the multiple inputs.
Unlike speech-driven gesture or talking head generation, we introduce more modals in this task, hoping to benefit several research fields.
- Score: 58.168958284290156
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Responsive listening during face-to-face conversations is a critical element
of social interaction and is well established in psychological research.
Through non-verbal signals response to the speakers' words, intonations, or
behaviors in real-time, listeners show how they are engaged in dialogue. In
this work, we build the Responsive Listener Dataset (RLD), a conversation video
corpus collected from the public resources featuring 67 speakers, 76 listeners
with three different attitudes. We define the responsive listening head
generation task as the synthesis of a non-verbal head with motions and
expressions reacting to the multiple inputs, including the audio and visual
signal of the speaker. Unlike speech-driven gesture or talking head generation,
we introduce more modals in this task, hoping to benefit several research
fields, including human-to-human interaction, video-to-video translation,
cross-modal understanding, and generation. Furthermore, we release an attitude
conditioned listening head generation baseline. Project page:
\url{https://project.mhzhou.com/rld}.
Related papers
- Emotional Listener Portrait: Realistic Listener Motion Simulation in
Conversation [50.35367785674921]
Listener head generation centers on generating non-verbal behaviors of a listener in reference to the information delivered by a speaker.
A significant challenge when generating such responses is the non-deterministic nature of fine-grained facial expressions during a conversation.
We propose the Emotional Listener Portrait (ELP), which treats each fine-grained facial motion as a composition of several discrete motion-codewords.
Our ELP model can not only automatically generate natural and diverse responses toward a given speaker via sampling from the learned distribution but also generate controllable responses with a predetermined attitude.
arXiv Detail & Related papers (2023-09-29T18:18:32Z) - MFR-Net: Multi-faceted Responsive Listening Head Generation via
Denoising Diffusion Model [14.220727407255966]
Responsive listening head generation is an important task that aims to model face-to-face communication scenarios.
We propose the textbfMulti-textbfFaceted textbfResponsive Listening Head Generation Network (MFR-Net)
arXiv Detail & Related papers (2023-08-31T11:10:28Z) - Can Language Models Learn to Listen? [96.01685069483025]
We present a framework for generating appropriate facial responses from a listener in dyadic social interactions based on the speaker's words.
Our approach autoregressively predicts a response of a listener: a sequence of listener facial gestures, quantized using a VQ-VAE.
We show that our generated listener motion is fluent and reflective of language semantics through quantitative metrics and a qualitative user study.
arXiv Detail & Related papers (2023-08-21T17:59:02Z) - Hierarchical Semantic Perceptual Listener Head Video Generation: A
High-performance Pipeline [6.9329709955764045]
ViCo@2023 Conversational Head Generation Challenge in ACM Multimedia 2023 conference.
This paper is a technical report of ViCo@2023 Conversational Head Generation Challenge in ACM Multimedia 2023 conference.
arXiv Detail & Related papers (2023-07-19T08:16:34Z) - Interactive Conversational Head Generation [68.76774230274076]
We introduce a new conversation head generation benchmark for synthesizing behaviors of a single interlocutor in a face-to-face conversation.
The capability to automatically synthesize interlocutors which can participate in long and multi-turn conversations is vital and offer benefits for various applications.
arXiv Detail & Related papers (2023-07-05T08:06:26Z) - Visual-Aware Text-to-Speech [101.89332968344102]
We present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and visual feedback of the listener in face-to-face communication.
We devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis.
arXiv Detail & Related papers (2023-06-21T05:11:39Z) - DialogueNeRF: Towards Realistic Avatar Face-to-Face Conversation Video
Generation [54.84137342837465]
Face-to-face conversations account for the vast majority of daily conversations.
Most existing methods focused on single-person talking head generation.
We propose a novel unified framework based on neural radiance field (NeRF)
arXiv Detail & Related papers (2022-03-15T14:16:49Z) - Intelligent Conversational Android ERICA Applied to Attentive Listening
and Job Interview [41.789773897391605]
We have developed an intelligent conversational android ERICA.
We set up several social interaction tasks for ERICA, including attentive listening, job interview, and speed dating.
It has been evaluated with 40 senior people, engaged in conversation of 5-7 minutes without a conversation breakdown.
arXiv Detail & Related papers (2021-05-02T06:37:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.