Related papers: EGSTalker: Real-Time Audio-Driven Talking Head Generation with Efficient Gaussian Deformation

EGSTalker: Real-Time Audio-Driven Talking Head Generation with Efficient Gaussian Deformation

URL: http://arxiv.org/abs/2510.08587v1
Date: Fri, 03 Oct 2025 14:31:20 GMT
Title: EGSTalker: Real-Time Audio-Driven Talking Head Generation with Efficient Gaussian Deformation
Authors: Tianheng Zhu, Yinfeng Yu, Liejun Wang, Fuchun Sun, Wendong Zheng,
Abstract summary: EGSTalker is a real-time audio-driven talking head generation framework based on 3D Gaussian Splatting (3DGS)<n>It requires only 3-5 minutes of training video to synthesize high-quality facial animations.<n>EGSTalker achieves rendering quality and lip-sync accuracy comparable to state-of-the-art methods, while significantly outperforming them in inference speed.
Score: 37.390794417927644
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: This paper presents EGSTalker, a real-time audio-driven talking head generation framework based on 3D Gaussian Splatting (3DGS). Designed to enhance both speed and visual fidelity, EGSTalker requires only 3-5 minutes of training video to synthesize high-quality facial animations. The framework comprises two key stages: static Gaussian initialization and audio-driven deformation. In the first stage, a multi-resolution hash triplane and a Kolmogorov-Arnold Network (KAN) are used to extract spatial features and construct a compact 3D Gaussian representation. In the second stage, we propose an Efficient Spatial-Audio Attention (ESAA) module to fuse audio and spatial cues, while KAN predicts the corresponding Gaussian deformations. Extensive experiments demonstrate that EGSTalker achieves rendering quality and lip-sync accuracy comparable to state-of-the-art methods, while significantly outperforming them in inference speed. These results highlight EGSTalker's potential for real-time multimedia applications.

Related papers

Representation-Regularized Convolutional Audio Transformer for Audio Understanding [53.092757178419355]
bootstrapping representations from scratch is computationally expensive, often requiring extensive training to converge.<n>We propose the Convolutional Audio Transformer (CAT), a unified framework designed to address these challenges.
arXiv Detail & Related papers (2026-01-29T12:16:19Z)
PGSTalker: Real-Time Audio-Driven Talking Head Generation via 3D Gaussian Splatting with Pixel-Aware Density Control [37.390794417927644]
We present PGSTalker, a real-time audio-driven talking head synthesis framework based on 3D Gaussian Splatting (3DGS)<n>To improve rendering performance, we propose a pixel-aware density control strategy that adaptively allocates point density, enhancing detail in dynamic facial regions while reducing redundancy elsewhere.
arXiv Detail & Related papers (2025-09-21T05:01:54Z)
D^3-Talker: Dual-Branch Decoupled Deformation Fields for Few-Shot 3D Talking Head Synthesis [28.923949756720425]
A key challenge in 3D talking head synthesis lies in the reliance on a long-duration talking head video to train a new model from scratch.<n>Recent methods have attempted to address this issue by extracting general features from audio through pre-training models.<n>This paper proposes D3-Talker, a novel approach that constructs a static 3D Gaussian attribute field and employs audio and Facial Motion signals.
arXiv Detail & Related papers (2025-08-20T06:12:33Z)
Audio-Plane: Audio Factorization Plane Gaussian Splatting for Real-Time Talking Head Synthesis [56.749927786910554]
We propose a novel framework that integrates Gaussian Splatting with a structured Audio Factorization Plane (Audio-Plane) to enable high-quality, audio-synchronized, and real-time talking head generation.<n>Our method achieves state-of-the-art visual quality, precise audio-lip synchronization, and real-time performance, outperforming prior approaches across both 2D- and 3D-based paradigms.
arXiv Detail & Related papers (2025-03-28T16:50:27Z)
MVSGaussian: Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo [54.00987996368157]
We present MVSGaussian, a new generalizable 3D Gaussian representation approach derived from Multi-View Stereo (MVS) MVSGaussian achieves real-time rendering with better synthesis quality for each scene.
arXiv Detail & Related papers (2024-05-20T17:59:30Z)
GSTalker: Real-time Audio-Driven Talking Face Generation via Deformable Gaussian Splatting [57.59261043916292]
GStalker is a 3D audio-driven talking face generation model with Gaussian Splatting. It can generate high-fidelity and audio-lips synchronized results with fast training and real-time rendering speed.
arXiv Detail & Related papers (2024-04-29T18:28:36Z)
GaussianTalker: Real-Time High-Fidelity Talking Head Synthesis with Audio-Driven 3D Gaussian Splatting [25.78134656333095]
We propose a novel framework for real-time generation of pose-controllable talking heads. GaussianTalker builds a canonical 3DGS representation of the head and deforms it in sync with the audio. It exploits the spatial-aware features and enforces interactions between neighboring points.
arXiv Detail & Related papers (2024-04-24T17:45:24Z)
GaussianTalker: Speaker-specific Talking Head Synthesis via 3D Gaussian Splatting [27.699313086744237]
GaussianTalker is a novel method for audio-driven talking head synthesis based on 3D Gaussian Splatting. Speaker-specific Motion Translator achieves accurate lip movements specific to the target speaker through universalized audio feature extraction. Dynamic Gaussian Renderer introduces Speaker-specific BlendShapes to enhance facial detail representation via a latent pose.
arXiv Detail & Related papers (2024-04-22T09:51:43Z)
SVTS: Scalable Video-to-Speech Synthesis [105.29009019733803]
We introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder. We are the first to show intelligible results on the challenging LRS3 dataset.
arXiv Detail & Related papers (2022-05-04T13:34:07Z)
End-to-end Audio-visual Speech Recognition with Conformers [65.30276363777514]
We present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer) In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms. We show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.
arXiv Detail & Related papers (2021-02-12T18:00:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.