Related papers: ReactDiff: Latent Diffusion for Facial Reaction Generation

ReactDiff: Latent Diffusion for Facial Reaction Generation

URL: http://arxiv.org/abs/2505.14151v3
Date: Wed, 04 Jun 2025 04:30:30 GMT
Title: ReactDiff: Latent Diffusion for Facial Reaction Generation
Authors: Jiaming Li, Sheng Wang, Xin Wang, Yitao Zhu, Honglin Xiong, Zixu Zhuang, Qian Wang,
Abstract summary: Given the audio-visual clip of the speaker, facial reaction generation aims to predict the listener's facial reactions.<n>We propose the Facial Reaction Diffusion (ReactDiff) framework that integrates a Multi-Modality Transformer with conditional diffusion.<n> Experimental results demonstrate that ReactDiff significantly outperforms existing approaches, achieving a facial reaction correlation of 0.26 and diversity score of 0.094.
Score: 15.490774894749277
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Given the audio-visual clip of the speaker, facial reaction generation aims to predict the listener's facial reactions. The challenge lies in capturing the relevance between video and audio while balancing appropriateness, realism, and diversity. While prior works have mostly focused on uni-modal inputs or simplified reaction mappings, recent approaches such as PerFRDiff have explored multi-modal inputs and the one-to-many nature of appropriate reaction mappings. In this work, we propose the Facial Reaction Diffusion (ReactDiff) framework that uniquely integrates a Multi-Modality Transformer with conditional diffusion in the latent space for enhanced reaction generation. Unlike existing methods, ReactDiff leverages intra- and inter-class attention for fine-grained multi-modal interaction, while the latent diffusion process between the encoder and decoder enables diverse yet contextually appropriate outputs. Experimental results demonstrate that ReactDiff significantly outperforms existing approaches, achieving a facial reaction correlation of 0.26 and diversity score of 0.094 while maintaining competitive realism. The code is open-sourced at \href{https://github.com/Hunan-Tiger/ReactDiff}{github}.

Related papers

FauForensics: Boosting Audio-Visual Deepfake Detection with Facial Action Units [40.86547778808649]
We propose a novel framework called FauForensics for detecting audio-visual deepfakes.<n>Our method computes fine-grained frame-wise audiovisual similarities via a dedicated fusion module.<n>Experiments on FakeAVCeleb and LAV-DF show state-of-the-art (SOTA) performance and superior cross-dataset generalizability with up to an average of 4.83%.
arXiv Detail & Related papers (2025-05-13T07:18:07Z)
Latent Behavior Diffusion for Sequential Reaction Generation in Dyadic Setting [11.016004057765185]
The dyadic reaction generation task involves responsive facial reactions that align closely with the behaviors of a conversational partner.<n>This paper introduces a novel approach, the Latent Behavior Diffusion Model, comprising a context-aware autoencoder and a diffusion-based conditional generator.<n> Experimental results demonstrate the effectiveness of our approach in achieving superior performance in dyadic reaction synthesis tasks compared to existing methods.
arXiv Detail & Related papers (2025-05-12T09:22:27Z)
Hierarchical Banzhaf Interaction for General Video-Language Representation Learning [60.44337740854767]
Multimodal representation learning plays an important role in the artificial intelligence domain.<n>We introduce a new approach that models video-text as game players using multivariate cooperative game theory.<n>We extend our original structure into a flexible encoder-decoder framework, enabling the model to adapt to various downstream tasks.
arXiv Detail & Related papers (2024-12-30T14:09:15Z)
High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model [89.29655924125461]
We propose a novel landmark-based diffusion model for talking face generation. We first establish the less ambiguous mapping from audio to landmark motion of lip and jaw. Then, we introduce an innovative conditioning module called TalkFormer to align the synthesized motion with the motion represented by landmarks.
arXiv Detail & Related papers (2024-08-10T02:58:28Z)
Counterfactual Cross-modality Reasoning for Weakly Supervised Video Moment Localization [67.88493779080882]
Video moment localization aims to retrieve the target segment of an untrimmed video according to the natural language query. Recent works contrast the cross-modality similarities driven by reconstructing masked queries. We propose a novel proposed counterfactual cross-modality reasoning method.
arXiv Detail & Related papers (2023-08-10T15:45:45Z)
ReactFace: Online Multiple Appropriate Facial Reaction Generation in Dyadic Interactions [46.66378299720377]
In dyadic interaction, predicting the listener's facial reactions is challenging as different reactions could be appropriate in response to the same speaker's behaviour. This paper reformulates the task as an extrapolation or prediction problem, and proposes a novel framework (called ReactFace) to generate multiple different but appropriate facial reactions.
arXiv Detail & Related papers (2023-05-25T05:55:53Z)
Reversible Graph Neural Network-based Reaction Distribution Learning for Multiple Appropriate Facial Reactions Generation [22.579200870471475]
This paper proposes the first multiple appropriate facial reaction generation framework. It re-formulates the one-to-many mapping facial reaction generation problem as a one-to-one mapping problem. Experimental results demonstrate that our approach outperforms existing models in generating more appropriate, realistic, and synchronized facial reactions.
arXiv Detail & Related papers (2023-05-24T15:56:26Z)
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model [56.03464169048182]
Existing text-video retrieval solutions focus on maximizing the conditional likelihood, i.e., p(candidates|query) We creatively tackle this task from a generative viewpoint and model the correlation between the text and the video as their joint probability p(candidates,query) This is accomplished through a diffusion-based text-video retrieval framework (DiffusionRet), which models the retrieval task as a process of gradually generating joint distribution from noise.
arXiv Detail & Related papers (2023-03-17T10:07:19Z)
Interaction Mix and Match: Synthesizing Close Interaction using Conditional Hierarchical GAN with Multi-Hot Class Embedding [4.864897201841002]
We propose a novel way to create realistic human reactive motions by mixing and matching different types of close interactions. Experiments are conducted both noisy (depth-based) and high-quality (versa-based) interaction datasets.
arXiv Detail & Related papers (2022-07-23T16:13:10Z)
Disentangled Representation Learning for Text-Video Retrieval [51.861423831566626]
Cross-modality interaction is a critical component in Text-Video Retrieval (TVR) We study the interaction paradigm in depth, where we find that its computation can be split into two terms. We propose a disentangled framework to capture a sequential and hierarchical representation.
arXiv Detail & Related papers (2022-03-14T13:55:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.