Related papers: MoDiTalker: Motion-Disentangled Diffusion Model for High-Fidelity Talking Head Generation

MoDiTalker: Motion-Disentangled Diffusion Model for High-Fidelity Talking Head Generation

URL: http://arxiv.org/abs/2403.19144v1
Date: Thu, 28 Mar 2024 04:35:42 GMT
Title: MoDiTalker: Motion-Disentangled Diffusion Model for High-Fidelity Talking Head Generation
Authors: Seyeon Kim, Siyoon Jin, Jihye Park, Kihong Kim, Jiyoung Kim, Jisu Nam, Seungryong Kim,
Abstract summary: We propose a novel motion-disentangled diffusion model for talking head generation, dubbed MoDiTalker. We introduce the two modules: audio-to-motion (AToM), designed to generate a synchronized lip motion from audio, and motion-to-video (MToV), designed to produce high-quality head video following the generated motion. Our experiments conducted on standard benchmarks demonstrate that our model achieves superior performance compared to existing models.
Score: 29.620451579580763
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Conventional GAN-based models for talking head generation often suffer from limited quality and unstable training. Recent approaches based on diffusion models aimed to address these limitations and improve fidelity. However, they still face challenges, including extensive sampling times and difficulties in maintaining temporal consistency due to the high stochasticity of diffusion models. To overcome these challenges, we propose a novel motion-disentangled diffusion model for high-quality talking head generation, dubbed MoDiTalker. We introduce the two modules: audio-to-motion (AToM), designed to generate a synchronized lip motion from audio, and motion-to-video (MToV), designed to produce high-quality head video following the generated motion. AToM excels in capturing subtle lip movements by leveraging an audio attention mechanism. In addition, MToV enhances temporal consistency by leveraging an efficient tri-plane representation. Our experiments conducted on standard benchmarks demonstrate that our model achieves superior performance compared to existing models. We also provide comprehensive ablation studies and user study results.

Related papers

READ: Real-time and Efficient Asynchronous Diffusion for Audio-driven Talking Head Generation [55.58089937219475]
We propose READ, the first real-time diffusion-transformer-based talking head generation framework.<n>Our approach first learns highly compressed video latent space via a VAE, significantly reducing the token count to speech generation.<n>We show that READ outperforms state-of-the-art methods by generating competitive talking head videos with significantly reduced runtime.
arXiv Detail & Related papers (2025-08-05T13:57:03Z)
Dual-Expert Consistency Model for Efficient and High-Quality Video Generation [57.33788820909211]
We propose a parameter-efficient textbfDual-Expert Consistency Model(DCM), where a semantic expert focuses on learning semantic layout and motion, while a detail expert specializes in fine detail refinement.<n>Our approach achieves state-of-the-art visual quality with significantly reduced sampling steps, demonstrating the effectiveness of expert specialization in video diffusion model distillation.
arXiv Detail & Related papers (2025-06-03T17:55:04Z)
FlowMo: Variance-Based Flow Guidance for Coherent Motion in Video Generation [51.110607281391154]
FlowMo is a training-free guidance method for enhancing motion coherence in text-to-video models.<n>It estimates motion coherence by measuring the patch-wise variance across the temporal dimension and guides the model to reduce this variance dynamically during sampling.
arXiv Detail & Related papers (2025-06-01T19:55:33Z)
One-Step Diffusion Model for Image Motion-Deblurring [85.76149042561507]
We propose a one-step diffusion model for deblurring (OSDD), a novel framework that reduces the denoising process to a single step. To tackle fidelity loss in diffusion models, we introduce an enhanced variational autoencoder (eVAE), which improves structural restoration. Our method achieves strong performance on both full and no-reference metrics.
arXiv Detail & Related papers (2025-03-09T09:39:57Z)
FAM Diffusion: Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion [63.609399000712905]
Inference at a scaled resolution leads to repetitive patterns and structural distortions. We propose two simple modules that combine to solve these issues. Our method, coined Fam diffusion, can seamlessly integrate into any latent diffusion model and requires no additional training.
arXiv Detail & Related papers (2024-11-27T17:51:44Z)
Energy-Based Diffusion Language Models for Text Generation [126.23425882687195]
Energy-based Diffusion Language Model (EDLM) is an energy-based model operating at the full sequence level for each diffusion step. Our framework offers a 1.3$times$ sampling speedup over existing diffusion models.
arXiv Detail & Related papers (2024-10-28T17:25:56Z)
ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation [83.62931466231898]
This paper presents ARLON, a framework that boosts diffusion Transformers with autoregressive models for long video generation. A latent Vector Quantized Variational Autoencoder (VQ-VAE) compresses the input latent space of the DiT model into compact visual tokens. An adaptive norm-based semantic injection module integrates the coarse discrete visual units from the AR model into the DiT model.
arXiv Detail & Related papers (2024-10-27T16:28:28Z)
High-Resolution Speech Restoration with Latent Diffusion Model [24.407232363131534]
Generative models that handle multiple distortions frequently struggle with phone reconstruction and high-frequency harmonics. We propose Hi-ResLDM, a novel generative model based on latent diffusion designed to remove multiple distortions and restore speech recordings to studio quality. We benchmark Hi-ResLDM against state-of-the-art methods that leverage GAN and Conditional Flow Matching (CFM) components, demonstrating superior performance in regenerating high-frequency-band details.
arXiv Detail & Related papers (2024-09-17T12:55:23Z)
Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation [22.159117464397806]
We introduce a two-stage diffusion-based model for talking head generation. The first stage involves generating synchronized facial landmarks based on the given speech. In the second stage, these generated landmarks serve as a condition in the denoising process, aiming to optimize mouth jitter issues and generate high-fidelity, well-synchronized, and temporally coherent talking head videos.
arXiv Detail & Related papers (2024-08-03T10:19:38Z)
Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution [65.91317390645163]
Upscale-A-Video is a text-guided latent diffusion framework for video upscaling. It ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences. It also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation.
arXiv Detail & Related papers (2023-12-11T18:54:52Z)
CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling [21.380988939240844]
We introduce a multi-modal diffusion model tailored for the bi-directional conditional generation of video and audio. We propose a joint contrastive training loss to improve the synchronization between visual and auditory occurrences.
arXiv Detail & Related papers (2023-12-08T23:55:19Z)
An Efficient Membership Inference Attack for the Diffusion Model by Proximal Initialization [58.88327181933151]
In this paper, we propose an efficient query-based membership inference attack (MIA) Experimental results indicate that the proposed method can achieve competitive performance with only two queries on both discrete-time and continuous-time diffusion models. To the best of our knowledge, this work is the first to study the robustness of diffusion models to MIA in the text-to-speech task.
arXiv Detail & Related papers (2023-05-26T16:38:48Z)
Diffusion Models as Masked Autoencoders [52.442717717898056]
We revisit generatively pre-training visual representations in light of recent interest in denoising diffusion models. While directly pre-training with diffusion models does not produce strong representations, we condition diffusion models on masked input and formulate diffusion models as masked autoencoders (DiffMAE) We perform a comprehensive study on the pros and cons of design choices and build connections between diffusion models and masked autoencoders.
arXiv Detail & Related papers (2023-04-06T17:59:56Z)
Autoregressive GAN for Semantic Unconditional Head Motion Generation [0.0]
We devise a GAN-based architecture that learns to synthesize rich head motion sequences over long duration while maintaining low error accumulation levels. We experimentally demonstrate the relevance of the proposed method and show its superiority compared to models that attained state-of-the-art performances on similar tasks.
arXiv Detail & Related papers (2022-11-02T09:48:49Z)
Diffusion Models in Vision: A Survey [80.82832715884597]
A diffusion model is a deep generative model that is based on two stages, a forward diffusion stage and a reverse diffusion stage. Diffusion models are widely appreciated for the quality and diversity of the generated samples, despite their known computational burdens.
arXiv Detail & Related papers (2022-09-10T22:00:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.