Laughing Matters: Introducing Laughing-Face Generation using Diffusion
Models
- URL: http://arxiv.org/abs/2305.08854v2
- Date: Wed, 30 Aug 2023 14:01:36 GMT
- Title: Laughing Matters: Introducing Laughing-Face Generation using Diffusion
Models
- Authors: Antoni Bigata Casademunt, Rodrigo Mira, Nikita Drobyshev, Konstantinos
Vougioukas, Stavros Petridis, Maja Pantic
- Abstract summary: We propose a novel model capable of generating realistic laughter sequences, given a still portrait and an audio clip containing laughter.
We train our model on a diverse set of laughter datasets and introduce an evaluation metric specifically designed for laughter.
Our model achieves state-of-the-art performance across all metrics, even when these are re-trained for laughter generation.
- Score: 35.688696422879175
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech-driven animation has gained significant traction in recent years, with
current methods achieving near-photorealistic results. However, the field
remains underexplored regarding non-verbal communication despite evidence
demonstrating its importance in human interaction. In particular, generating
laughter sequences presents a unique challenge due to the intricacy and nuances
of this behaviour. This paper aims to bridge this gap by proposing a novel
model capable of generating realistic laughter sequences, given a still
portrait and an audio clip containing laughter. We highlight the failure cases
of traditional facial animation methods and leverage recent advances in
diffusion models to produce convincing laughter videos. We train our model on a
diverse set of laughter datasets and introduce an evaluation metric
specifically designed for laughter. When compared with previous speech-driven
approaches, our model achieves state-of-the-art performance across all metrics,
even when these are re-trained for laughter generation. Our code and project
are publicly available
Related papers
- EMO: Emote Portrait Alive -- Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions [18.364859748601887]
We propose EMO, a novel framework that utilizes a direct audio-to-video synthesis approach.
Our method ensures seamless frame transitions and consistent identity preservation throughout the video, resulting in highly expressive and lifelike animations.
arXiv Detail & Related papers (2024-02-27T13:10:11Z) - Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like [49.2096391012794]
ELaTE is a zero-shot TTS that can generate natural laughing speech of any speaker based on a short audio prompt.
We develop our model based on the foundation of conditional flow-matching-based zero-shot TTS.
We show that ELaTE can generate laughing speech with significantly higher quality and controllability compared to conventional models.
arXiv Detail & Related papers (2024-02-12T02:58:10Z) - SMILE: Multimodal Dataset for Understanding Laughter in Video with Language Models [32.60274453610208]
We tackle a new challenge for machines to understand the rationale behind laughter in video.
Our proposed dataset, SMILE, comprises video clips and language descriptions of why people laugh.
arXiv Detail & Related papers (2023-12-15T14:17:45Z) - LaughTalk: Expressive 3D Talking Head Generation with Laughter [15.60843963655039]
We introduce a novel task to generate 3D talking heads capable of both articulate speech and authentic laughter.
Our newly curated dataset comprises 2D laughing videos paired with pseudo-annotated and human-validated 3D FLAME parameters.
Our method performs favorably compared to existing approaches in both talking head generation and expressing laughter signals.
arXiv Detail & Related papers (2023-11-02T05:04:33Z) - Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a
Short Video [91.92782707888618]
We present a decomposition-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance.
We show that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization.
arXiv Detail & Related papers (2023-09-09T14:52:39Z) - Identity-Preserving Talking Face Generation with Landmark and Appearance
Priors [106.79923577700345]
Existing person-generic methods have difficulty in generating realistic and lip-synced videos.
We propose a two-stage framework consisting of audio-to-landmark generation and landmark-to-video rendering procedures.
Our method can produce more realistic, lip-synced, and identity-preserving videos than existing person-generic talking face generation methods.
arXiv Detail & Related papers (2023-05-15T01:31:32Z) - Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation [54.68893964373141]
Talking face generation has historically struggled to produce head movements and natural facial expressions without guidance from additional reference videos.
Recent developments in diffusion-based generative models allow for more realistic and stable data synthesis.
We present an autoregressive diffusion model that requires only one identity image and audio sequence to generate a video of a realistic talking human head.
arXiv Detail & Related papers (2023-01-06T14:16:54Z) - Impact of annotation modality on label quality and model performance in
the automatic assessment of laughter in-the-wild [8.242747994568212]
It is unclear how perception and annotation of laughter differ when annotated from other modalities like video, via the body movements of laughter.
We ask whether annotations of laughter are congruent across modalities, and compare the effect that labeling modality has on machine learning model performance.
Our analysis of more than 4000 annotations acquired from 48 annotators revealed evidence for incongruity in the perception of laughter, and its intensity between modalities.
arXiv Detail & Related papers (2022-11-02T00:18:08Z) - MeshTalk: 3D Face Animation from Speech using Cross-Modality
Disentanglement [142.9900055577252]
We propose a generic audio-driven facial animation approach that achieves highly realistic motion synthesis results for the entire face.
Our approach ensures highly accurate lip motion, while also plausible animation of the parts of the face that are uncorrelated to the audio signal, such as eye blinks and eye brow motion.
arXiv Detail & Related papers (2021-04-16T17:05:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.