EGSTalker: Real-Time Audio-Driven Talking Head Generation with Efficient Gaussian Deformation
- URL: http://arxiv.org/abs/2510.08587v1
- Date: Fri, 03 Oct 2025 14:31:20 GMT
- Title: EGSTalker: Real-Time Audio-Driven Talking Head Generation with Efficient Gaussian Deformation
- Authors: Tianheng Zhu, Yinfeng Yu, Liejun Wang, Fuchun Sun, Wendong Zheng,
- Abstract summary: EGSTalker is a real-time audio-driven talking head generation framework based on 3D Gaussian Splatting (3DGS)<n>It requires only 3-5 minutes of training video to synthesize high-quality facial animations.<n>EGSTalker achieves rendering quality and lip-sync accuracy comparable to state-of-the-art methods, while significantly outperforming them in inference speed.
- Score: 37.390794417927644
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This paper presents EGSTalker, a real-time audio-driven talking head generation framework based on 3D Gaussian Splatting (3DGS). Designed to enhance both speed and visual fidelity, EGSTalker requires only 3-5 minutes of training video to synthesize high-quality facial animations. The framework comprises two key stages: static Gaussian initialization and audio-driven deformation. In the first stage, a multi-resolution hash triplane and a Kolmogorov-Arnold Network (KAN) are used to extract spatial features and construct a compact 3D Gaussian representation. In the second stage, we propose an Efficient Spatial-Audio Attention (ESAA) module to fuse audio and spatial cues, while KAN predicts the corresponding Gaussian deformations. Extensive experiments demonstrate that EGSTalker achieves rendering quality and lip-sync accuracy comparable to state-of-the-art methods, while significantly outperforming them in inference speed. These results highlight EGSTalker's potential for real-time multimedia applications.
Related papers
- Representation-Regularized Convolutional Audio Transformer for Audio Understanding [53.092757178419355]
bootstrapping representations from scratch is computationally expensive, often requiring extensive training to converge.<n>We propose the Convolutional Audio Transformer (CAT), a unified framework designed to address these challenges.
arXiv Detail & Related papers (2026-01-29T12:16:19Z) - PGSTalker: Real-Time Audio-Driven Talking Head Generation via 3D Gaussian Splatting with Pixel-Aware Density Control [37.390794417927644]
We present PGSTalker, a real-time audio-driven talking head synthesis framework based on 3D Gaussian Splatting (3DGS)<n>To improve rendering performance, we propose a pixel-aware density control strategy that adaptively allocates point density, enhancing detail in dynamic facial regions while reducing redundancy elsewhere.
arXiv Detail & Related papers (2025-09-21T05:01:54Z) - D^3-Talker: Dual-Branch Decoupled Deformation Fields for Few-Shot 3D Talking Head Synthesis [28.923949756720425]
A key challenge in 3D talking head synthesis lies in the reliance on a long-duration talking head video to train a new model from scratch.<n>Recent methods have attempted to address this issue by extracting general features from audio through pre-training models.<n>This paper proposes D3-Talker, a novel approach that constructs a static 3D Gaussian attribute field and employs audio and Facial Motion signals.
arXiv Detail & Related papers (2025-08-20T06:12:33Z) - Audio-Plane: Audio Factorization Plane Gaussian Splatting for Real-Time Talking Head Synthesis [56.749927786910554]
We propose a novel framework that integrates Gaussian Splatting with a structured Audio Factorization Plane (Audio-Plane) to enable high-quality, audio-synchronized, and real-time talking head generation.<n>Our method achieves state-of-the-art visual quality, precise audio-lip synchronization, and real-time performance, outperforming prior approaches across both 2D- and 3D-based paradigms.
arXiv Detail & Related papers (2025-03-28T16:50:27Z) - MVSGaussian: Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo [54.00987996368157]
We present MVSGaussian, a new generalizable 3D Gaussian representation approach derived from Multi-View Stereo (MVS)
MVSGaussian achieves real-time rendering with better synthesis quality for each scene.
arXiv Detail & Related papers (2024-05-20T17:59:30Z) - GSTalker: Real-time Audio-Driven Talking Face Generation via Deformable Gaussian Splatting [57.59261043916292]
GStalker is a 3D audio-driven talking face generation model with Gaussian Splatting.
It can generate high-fidelity and audio-lips synchronized results with fast training and real-time rendering speed.
arXiv Detail & Related papers (2024-04-29T18:28:36Z) - GaussianTalker: Real-Time High-Fidelity Talking Head Synthesis with Audio-Driven 3D Gaussian Splatting [25.78134656333095]
We propose a novel framework for real-time generation of pose-controllable talking heads.
GaussianTalker builds a canonical 3DGS representation of the head and deforms it in sync with the audio.
It exploits the spatial-aware features and enforces interactions between neighboring points.
arXiv Detail & Related papers (2024-04-24T17:45:24Z) - GaussianTalker: Speaker-specific Talking Head Synthesis via 3D Gaussian Splatting [27.699313086744237]
GaussianTalker is a novel method for audio-driven talking head synthesis based on 3D Gaussian Splatting.
Speaker-specific Motion Translator achieves accurate lip movements specific to the target speaker through universalized audio feature extraction.
Dynamic Gaussian Renderer introduces Speaker-specific BlendShapes to enhance facial detail representation via a latent pose.
arXiv Detail & Related papers (2024-04-22T09:51:43Z) - SVTS: Scalable Video-to-Speech Synthesis [105.29009019733803]
We introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder.
We are the first to show intelligible results on the challenging LRS3 dataset.
arXiv Detail & Related papers (2022-05-04T13:34:07Z) - End-to-end Audio-visual Speech Recognition with Conformers [65.30276363777514]
We present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer)
In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms.
We show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.
arXiv Detail & Related papers (2021-02-12T18:00:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.