RSATalker: Realistic Socially-Aware Talking Head Generation for Multi-Turn Conversation
- URL: http://arxiv.org/abs/2601.10606v1
- Date: Thu, 15 Jan 2026 17:23:19 GMT
- Title: RSATalker: Realistic Socially-Aware Talking Head Generation for Multi-Turn Conversation
- Authors: Peng Chen, Xiaobao Wei, Yi Yang, Naiming Yao, Hui Chen, Feng Tian,
- Abstract summary: We introduce RSATalker, the first framework that leverages 3DGS for realistic and socially-aware talking head generation.<n>Our method first drives mesh-based 3D facial motion from speech, then binds 3D Gaussians to mesh facets to render high-fidelity 2D avatar videos.<n>To capture interpersonal dynamics, we propose a socially-aware module that encodes social relationships, including blood and non-blood as well as equal and unequal.
- Score: 16.484330085082536
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Talking head generation is increasingly important in virtual reality (VR), especially for social scenarios involving multi-turn conversation. Existing approaches face notable limitations: mesh-based 3D methods can model dual-person dialogue but lack realistic textures, while large-model-based 2D methods produce natural appearances but incur prohibitive computational costs. Recently, 3D Gaussian Splatting (3DGS) based methods achieve efficient and realistic rendering but remain speaker-only and ignore social relationships. We introduce RSATalker, the first framework that leverages 3DGS for realistic and socially-aware talking head generation with support for multi-turn conversation. Our method first drives mesh-based 3D facial motion from speech, then binds 3D Gaussians to mesh facets to render high-fidelity 2D avatar videos. To capture interpersonal dynamics, we propose a socially-aware module that encodes social relationships, including blood and non-blood as well as equal and unequal, into high-level embeddings through a learnable query mechanism. We design a three-stage training paradigm and construct the RSATalker dataset with speech-mesh-image triplets annotated with social relationships. Extensive experiments demonstrate that RSATalker achieves state-of-the-art performance in both realism and social awareness. The code and dataset will be released.
Related papers
- Splat-Portrait: Generalizing Talking Heads with Gaussian Splatting [6.62155043692653]
Talking Head Generation aims at synthesizing natural-looking talking videos from speech and a single portrait image.<n>Previous 3D talking head generation methods have relied on domain-specifics such as warping-based facial motion representation priors to animate talking motions.<n>We introduce Splat-Portrait, a Gaussian-splatting-based method that addresses the challenges of 3D head reconstruction and lip motion synthesis.
arXiv Detail & Related papers (2026-01-26T16:06:57Z) - Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics [40.86039227407712]
We present TIMAR (Turn-level Interleaved Masked AutoRegression), a causal framework for 3D conversational head generation.<n>It fuses multimodal information within each turn and applies turn-level causal attention to accumulate conversational history.<n>Experiments on the DualTalk benchmark show that TIMAR reduces Fréchet Distance and MSE by 15-30% on the test set.
arXiv Detail & Related papers (2025-12-17T11:37:35Z) - GaussianSpeech: Audio-Driven Gaussian Avatars [76.10163891172192]
We introduce GaussianSpeech, a novel approach that synthesizes high-fidelity animation sequences of photo-realistic, personalized 3D human head avatars from spoken audio.<n>We propose a compact and efficient 3DGS-based avatar representation that generates expression-dependent color and leverages wrinkle- and perceptually-based losses to synthesize facial details.
arXiv Detail & Related papers (2024-11-27T18:54:08Z) - Learn2Talk: 3D Talking Face Learns from 2D Talking Face [15.99315075587735]
We propose a learning framework named Learn2Talk, which can construct a better 3D talking face network.
Inspired by the audio-video sync network, a 3D sync-lip expert model is devised for the pursuit of lip-sync.
A teacher model selected from 2D talking face methods is used to guide the training of the audio-to-3D motions regression network.
arXiv Detail & Related papers (2024-04-19T13:45:14Z) - Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis [88.17520303867099]
One-shot 3D talking portrait generation aims to reconstruct a 3D avatar from an unseen image, and then animate it with a reference video or audio.
We present Real3D-Potrait, a framework that improves the one-shot 3D reconstruction power with a large image-to-plane model.
Experiments show that Real3D-Portrait generalizes well to unseen identities and generates more realistic talking portrait videos.
arXiv Detail & Related papers (2024-01-16T17:04:30Z) - GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained
3D Face Guidance [83.43852715997596]
GSmoothFace is a novel two-stage generalized talking face generation model guided by a fine-grained 3d face model.
It can synthesize smooth lip dynamics while preserving the speaker's identity.
Both quantitative and qualitative experiments confirm the superiority of our method in terms of realism, lip synchronization, and visual quality.
arXiv Detail & Related papers (2023-12-12T16:00:55Z) - Generative Proxemics: A Prior for 3D Social Interaction from Images [32.547187575678464]
Social interaction is a fundamental aspect of human behavior and communication.
We present a novel approach that learns a prior over the 3D proxemics two people in close social interaction.
Our approach recovers accurate and plausible 3D social interactions from noisy initial estimates, outperforming state-of-the-art methods.
arXiv Detail & Related papers (2023-06-15T17:59:20Z) - DialogueNeRF: Towards Realistic Avatar Face-to-Face Conversation Video
Generation [54.84137342837465]
Face-to-face conversations account for the vast majority of daily conversations.
Most existing methods focused on single-person talking head generation.
We propose a novel unified framework based on neural radiance field (NeRF)
arXiv Detail & Related papers (2022-03-15T14:16:49Z) - EgoBody: Human Body Shape, Motion and Social Interactions from
Head-Mounted Devices [76.50816193153098]
EgoBody is a novel large-scale dataset for social interactions in complex 3D scenes.
We employ Microsoft HoloLens2 headsets to record rich egocentric data streams including RGB, depth, eye gaze, head and hand tracking.
To obtain accurate 3D ground-truth, we calibrate the headset with a multi-Kinect rig and fit expressive SMPL-X body meshes to multi-view RGB-D frames.
arXiv Detail & Related papers (2021-12-14T18:41:28Z) - Audio- and Gaze-driven Facial Animation of Codec Avatars [149.0094713268313]
We describe the first approach to animate Codec Avatars in real-time using audio and/or eye tracking.
Our goal is to display expressive conversations between individuals that exhibit important social signals.
arXiv Detail & Related papers (2020-08-11T22:28:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.