Related papers: SyncDiff: Diffusion-based Talking Head Synthesis with Bottlenecked Temporal Visual Prior for Improved Synchronization

SyncDiff: Diffusion-based Talking Head Synthesis with Bottlenecked Temporal Visual Prior for Improved Synchronization

URL: http://arxiv.org/abs/2503.13371v1
Date: Mon, 17 Mar 2025 16:58:53 GMT
Title: SyncDiff: Diffusion-based Talking Head Synthesis with Bottlenecked Temporal Visual Prior for Improved Synchronization
Authors: Xulin Fan, Heting Gao, Ziyi Chen, Peng Chang, Mei Han, Mark Hasegawa-Johnson,
Abstract summary: Talking head synthesis, also known as speech-to-lip synthesis, reconstructs the facial motions that align with the given audio tracks.<n>Recent studies demonstrate that GAN-based and diffusion-based models achieve state-of-the-art (SOTA) performance on this task.<n>We propose SyncDiff, a simple yet effective approach to improve diffusion-based models using a temporal pose frame with information bottleneck and facial-informative audio features extracted from AVHuBERT.
Score: 32.220603714163026
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Talking head synthesis, also known as speech-to-lip synthesis, reconstructs the facial motions that align with the given audio tracks. The synthesized videos are evaluated on mainly two aspects, lip-speech synchronization and image fidelity. Recent studies demonstrate that GAN-based and diffusion-based models achieve state-of-the-art (SOTA) performance on this task, with diffusion-based models achieving superior image fidelity but experiencing lower synchronization compared to their GAN-based counterparts. To this end, we propose SyncDiff, a simple yet effective approach to improve diffusion-based models using a temporal pose frame with information bottleneck and facial-informative audio features extracted from AVHuBERT, as conditioning input into the diffusion process. We evaluate SyncDiff on two canonical talking head datasets, LRS2 and LRS3 for direct comparison with other SOTA models. Experiments on LRS2/LRS3 datasets show that SyncDiff achieves a synchronization score 27.7%/62.3% relatively higher than previous diffusion-based methods, while preserving their high-fidelity characteristics.

Related papers

READ: Real-time and Efficient Asynchronous Diffusion for Audio-driven Talking Head Generation [55.58089937219475]
We propose READ, the first real-time diffusion-transformer-based talking head generation framework.<n>Our approach first learns highly compressed video latent space via a VAE, significantly reducing the token count to speech generation.<n>We show that READ outperforms state-of-the-art methods by generating competitive talking head videos with significantly reduced runtime.
arXiv Detail & Related papers (2025-08-05T13:57:03Z)
MirrorMe: Towards Realtime and High Fidelity Audio-Driven Halfbody Animation [21.216297567167036]
MirrorMe is a real-time, controllable framework built on the LTX video model.<n>MirrorMe compresses video spatially and temporally for efficient latent space denoising.<n> experiments on the EMTD Benchmark demonstrate MirrorMe's state-of-the-art performance in fidelity, lip-sync accuracy, and temporal stability.
arXiv Detail & Related papers (2025-06-27T09:57:23Z)
AsynFusion: Towards Asynchronous Latent Consistency Models for Decoupled Whole-Body Audio-Driven Avatars [65.53676584955686]
Whole-body audio-driven avatar pose and expression generation is a critical task for creating lifelike digital humans.<n>We propose AsynFusion, a novel framework that leverages diffusion transformers to achieve cohesive expression and gesture synthesis.<n>AsynFusion achieves state-of-the-art performance in generating real-time, synchronized whole-body animations.
arXiv Detail & Related papers (2025-05-21T03:28:53Z)
AccVideo: Accelerating Video Diffusion Model with Synthetic Dataset [55.82208863521353]
We propose AccVideo to reduce the inference steps for accelerating video diffusion models with synthetic dataset. Our model achieves 8.5x improvements in generation speed compared to the teacher model. Compared to previous accelerating methods, our approach is capable of generating videos with higher quality and resolution.
arXiv Detail & Related papers (2025-03-25T08:52:07Z)
StochSync: Stochastic Diffusion Synchronization for Image Generation in Arbitrary Spaces [11.517082612850443]
We propose a method for generating images in arbitrary spaces using a pretrained image diffusion model.<n>The zero-shot method combines the strengths of both image conditioning and 3D mesh-based methods.
arXiv Detail & Related papers (2025-01-26T08:22:44Z)
LatentSync: Taming Audio-Conditioned Latent Diffusion Models for Lip Sync with SyncNet Supervision [13.447234794835921]
We introduce StableSyncNet, with an architecture designed for stable convergence.<n>Our method surpasses state-of-the-art lip-sync approaches across various evaluation metrics on the HDTF and VoxCeleb2 datasets.
arXiv Detail & Related papers (2024-12-12T13:20:52Z)
SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and Music Synthesis [0.0]
We introduce SpecDiff-GAN, a neural vocoder based on HiFi-GAN. We show the merits of our proposed model for speech and music synthesis on several datasets.
arXiv Detail & Related papers (2024-01-30T09:17:57Z)
Synchformer: Efficient Synchronization from Sparse Cues [100.89656994681934]
Our contributions include a novel audio-visual synchronization model, and training that decouples extraction from synchronization modelling. This approach achieves state-of-the-art performance in both dense and sparse settings. We also extend synchronization model training to AudioSet a million-scale 'in-the-wild' dataset, investigate evidence attribution techniques for interpretability, and explore a new capability for synchronization models: audio-visual synchronizability.
arXiv Detail & Related papers (2024-01-29T18:59:55Z)
DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation [72.85685916829321]
DiffSHEG is a Diffusion-based approach for Speech-driven Holistic 3D Expression and Gesture generation with arbitrary length. By enabling the real-time generation of expressive and synchronized motions, DiffSHEG showcases its potential for various applications in the development of digital humans and embodied agents.
arXiv Detail & Related papers (2024-01-09T11:38:18Z)
ToddlerDiffusion: Interactive Structured Image Generation with Cascaded Schrödinger Bridge [63.00793292863]
ToddlerDiffusion is a novel approach to decomposing the complex task of RGB image generation into simpler, interpretable stages. Our method, termed ToddlerDiffusion, cascades modality-specific models, each responsible for generating an intermediate representation. ToddlerDiffusion consistently outperforms state-of-the-art methods.
arXiv Detail & Related papers (2023-11-24T15:20:01Z)
Adversarial Training of Denoising Diffusion Model Using Dual Discriminators for High-Fidelity Multi-Speaker TTS [0.0]
The diffusion model is capable of generating high-quality data through a probabilistic approach. It suffers from the drawback of slow generation speed due to the requirement of a large number of time steps. We propose a speech synthesis model with two discriminators: a diffusion discriminator for learning the distribution of the reverse process and a spectrogram discriminator for learning the distribution of the generated data.
arXiv Detail & Related papers (2023-08-03T07:22:04Z)
Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion [85.54515118077825]
This paper proposes a linear diffusion model (LinDiff) based on an ordinary differential equation to simultaneously reach fast inference and high sample quality. To reduce computational complexity, LinDiff employs a patch-based processing approach that partitions the input signal into small patches. Our model can synthesize speech of a quality comparable to that of autoregressive models with faster synthesis speed.
arXiv Detail & Related papers (2023-06-09T07:02:43Z)
ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech [63.780196620966905]
We propose ProDiff, on progressive fast diffusion model for high-quality text-to-speech. ProDiff parameterizes the denoising model by directly predicting clean data to avoid distinct quality degradation in accelerating sampling. Our evaluation demonstrates that ProDiff needs only 2 iterations to synthesize high-fidelity mel-spectrograms. ProDiff enables a sampling speed of 24x faster than real-time on a single NVIDIA 2080Ti GPU.
arXiv Detail & Related papers (2022-07-13T17:45:43Z)
BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis [129.86743102915986]
We formulate the synthesis process from a different perspective by decomposing the audio into a common part. We propose BinauralGrad, a novel two-stage framework equipped with diffusion models to synthesize them respectively. Experiment results show that BinauralGrad outperforms the existing baselines by a large margin in terms of both object and subject evaluation metrics.
arXiv Detail & Related papers (2022-05-30T02:09:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.