Hierarchical Semantic Perceptual Listener Head Video Generation: A
High-performance Pipeline
- URL: http://arxiv.org/abs/2307.09821v1
- Date: Wed, 19 Jul 2023 08:16:34 GMT
- Title: Hierarchical Semantic Perceptual Listener Head Video Generation: A
High-performance Pipeline
- Authors: Zhigang Chang, Weitai Hu, Qing Yang, Shibao Zheng
- Abstract summary: ViCo@2023 Conversational Head Generation Challenge in ACM Multimedia 2023 conference.
This paper is a technical report of ViCo@2023 Conversational Head Generation Challenge in ACM Multimedia 2023 conference.
- Score: 6.9329709955764045
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In dyadic speaker-listener interactions, the listener's head reactions along
with the speaker's head movements, constitute an important non-verbal semantic
expression together. The listener Head generation task aims to synthesize
responsive listener's head videos based on audios of the speaker and reference
images of the listener. Compared to the Talking-head generation, it is more
challenging to capture the correlation clues from the speaker's audio and
visual information. Following the ViCo baseline scheme, we propose a
high-performance solution by enhancing the hierarchical semantic extraction
capability of the audio encoder module and improving the decoder part, renderer
and post-processing modules. Our solution gets the first place on the official
leaderboard for the track of listening head generation. This paper is a
technical report of ViCo@2023 Conversational Head Generation Challenge in ACM
Multimedia 2023 conference.
Related papers
- A Comparative Study of Perceptual Quality Metrics for Audio-driven
Talking Head Videos [81.54357891748087]
We collect talking head videos generated from four generative methods.
We conduct controlled psychophysical experiments on visual quality, lip-audio synchronization, and head movement naturalness.
Our experiments validate consistency between model predictions and human annotations, identifying metrics that align better with human opinions than widely-used measures.
arXiv Detail & Related papers (2024-03-11T04:13:38Z) - MFR-Net: Multi-faceted Responsive Listening Head Generation via
Denoising Diffusion Model [14.220727407255966]
Responsive listening head generation is an important task that aims to model face-to-face communication scenarios.
We propose the textbfMulti-textbfFaceted textbfResponsive Listening Head Generation Network (MFR-Net)
arXiv Detail & Related papers (2023-08-31T11:10:28Z) - Interactive Conversational Head Generation [68.76774230274076]
We introduce a new conversation head generation benchmark for synthesizing behaviors of a single interlocutor in a face-to-face conversation.
The capability to automatically synthesize interlocutors which can participate in long and multi-turn conversations is vital and offer benefits for various applications.
arXiv Detail & Related papers (2023-07-05T08:06:26Z) - Modeling Speaker-Listener Interaction for Backchannel Prediction [24.52345279975304]
Backchanneling theories emphasize the active and continuous role of the listener in the course of a conversation.
We propose a neural-based acoustic backchannel classifier on minimal responses by processing acoustic features from the speaker speech.
Our experimental results on the Switchboard and GECO datasets reveal that in almost all tested scenarios the speaker or listener behavior embeddings help the model make more accurate backchannel predictions.
arXiv Detail & Related papers (2023-04-10T09:22:06Z) - DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven
Portraits Animation [78.08004432704826]
We model the Talking head generation as an audio-driven temporally coherent denoising process (DiffTalk)
In this paper, we investigate the control mechanism of the talking face, and incorporate reference face images and landmarks as conditions for personality-aware generalized synthesis.
Our DiffTalk can be gracefully tailored for higher-resolution synthesis with negligible extra computational cost.
arXiv Detail & Related papers (2023-01-10T05:11:25Z) - Perceptual Conversational Head Generation with Regularized Driver and
Enhanced Renderer [4.201920674650052]
Our solution focuses on training a generalized audio-to-head driver using regularization and assembling a high visual quality.
We get first place in the listening head generation track and second place in the talking head generation track in the official ranking.
arXiv Detail & Related papers (2022-06-26T10:12:59Z) - DialogueNeRF: Towards Realistic Avatar Face-to-Face Conversation Video
Generation [54.84137342837465]
Face-to-face conversations account for the vast majority of daily conversations.
Most existing methods focused on single-person talking head generation.
We propose a novel unified framework based on neural radiance field (NeRF)
arXiv Detail & Related papers (2022-03-15T14:16:49Z) - Responsive Listening Head Generation: A Benchmark Dataset and Baseline [58.168958284290156]
We define the responsive listening head generation task as the synthesis of a non-verbal head with motions and expressions reacting to the multiple inputs.
Unlike speech-driven gesture or talking head generation, we introduce more modals in this task, hoping to benefit several research fields.
arXiv Detail & Related papers (2021-12-27T07:18:50Z) - The Right to Talk: An Audio-Visual Transformer Approach [27.71444773878775]
This work introduces a new Audio-Visual Transformer approach to the problem of localization and highlighting the main speaker in both audio and visual channels of a multi-speaker conversation video in the wild.
To the best of our knowledge, it is one of the first studies that is able to automatically localize and highlight the main speaker in both visual and audio channels in multi-speaker conversation videos.
arXiv Detail & Related papers (2021-08-06T18:04:24Z) - AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis [55.24336227884039]
We present a novel framework to generate high-fidelity talking head video.
We use neural scene representation networks to bridge the gap between audio input and video output.
Our framework can (1) produce high-fidelity and natural results, and (2) support free adjustment of audio signals, viewing directions, and background images.
arXiv Detail & Related papers (2021-03-20T02:58:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.