FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait
- URL: http://arxiv.org/abs/2412.01064v2
- Date: Wed, 04 Dec 2024 09:43:18 GMT
- Title: FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait
- Authors: Taekyung Ki, Dongchan Min, Gyeongsu Chae,
- Abstract summary: FLOAT is an audio-driven talking portrait video generation method based on flow matching generative model.
We shift the generative modeling from the pixel-based latent space to a learned motion latent space, enabling efficient design of temporally consistent motion.
Our method supports speech-driven emotion enhancement, enabling a natural incorporation of expressive motions.
- Score: 3.3672851080270374
- License:
- Abstract: With the rapid advancement of diffusion-based generative models, portrait image animation has achieved remarkable results. However, it still faces challenges in temporally consistent video generation and fast sampling due to its iterative sampling nature. This paper presents FLOAT, an audio-driven talking portrait video generation method based on flow matching generative model. We shift the generative modeling from the pixel-based latent space to a learned motion latent space, enabling efficient design of temporally consistent motion. To achieve this, we introduce a transformer-based vector field predictor with a simple yet effective frame-wise conditioning mechanism. Additionally, our method supports speech-driven emotion enhancement, enabling a natural incorporation of expressive motions. Extensive experiments demonstrate that our method outperforms state-of-the-art audio-driven talking portrait methods in terms of visual quality, motion fidelity, and efficiency.
Related papers
- Image Motion Blur Removal in the Temporal Dimension with Video Diffusion Models [3.052019331122618]
We propose a novel single-image deblurring approach that treats motion blur as a temporal averaging phenomenon.
Our core innovation lies in leveraging a pre-trained video diffusion transformer model to capture diverse motion dynamics.
Empirical results on synthetic and real-world datasets demonstrate that our method outperforms existing techniques in deblurring complex motion blur scenarios.
arXiv Detail & Related papers (2025-01-22T03:01:54Z) - Real-time One-Step Diffusion-based Expressive Portrait Videos Generation [85.07446744308247]
We introduce OSA-LCM (One-Step Avatar Latent Consistency Model), paving the way for real-time diffusion-based avatars.
Our method achieves comparable video quality to existing methods but requires only one sampling step, making it more than 10x faster.
arXiv Detail & Related papers (2024-12-18T03:42:42Z) - Optical-Flow Guided Prompt Optimization for Coherent Video Generation [51.430833518070145]
We propose a framework called MotionPrompt that guides the video generation process via optical flow.
We optimize learnable token embeddings during reverse sampling steps by using gradients from a trained discriminator applied to random frame pairs.
This approach allows our method to generate visually coherent video sequences that closely reflect natural motion dynamics, without compromising the fidelity of the generated content.
arXiv Detail & Related papers (2024-11-23T12:26:52Z) - Oscillation Inversion: Understand the structure of Large Flow Model through the Lens of Inversion Method [60.88467353578118]
We show that a fixed-point-inspired iterative approach to invert real-world images does not achieve convergence, instead oscillating between distinct clusters.
We introduce a simple and fast distribution transfer technique that facilitates image enhancement, stroke-based recoloring, as well as visual prompt-guided image editing.
arXiv Detail & Related papers (2024-11-17T17:45:37Z) - Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency [15.841490425454344]
We propose an end-to-end audio-only conditioned video diffusion model named Loopy.
Specifically, we designed an inter- and intra-clip temporal module and an audio-to-latents module, enabling the model to leverage long-term motion information.
arXiv Detail & Related papers (2024-09-04T11:55:14Z) - Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation [29.87407471246318]
This research delves into the complexities of synchronizing facial movements and creating visually appealing, temporally consistent animations.
Our innovative approach embraces the end-to-end diffusion paradigm and introduces a hierarchical audio-driven visual synthesis module.
The proposed hierarchical audio-driven visual synthesis offers adaptive control over expression and pose diversity, enabling more effective personalization tailored to different identities.
arXiv Detail & Related papers (2024-06-13T04:33:20Z) - DyBluRF: Dynamic Neural Radiance Fields from Blurry Monocular Video [18.424138608823267]
We propose DyBluRF, a dynamic radiance field approach that synthesizes sharp novel views from a monocular video affected by motion blur.
To account for motion blur in input images, we simultaneously capture the camera trajectory and object Discrete Cosine Transform (DCT) trajectories within the scene.
arXiv Detail & Related papers (2024-03-15T08:48:37Z) - DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation [72.85685916829321]
DiffSHEG is a Diffusion-based approach for Speech-driven Holistic 3D Expression and Gesture generation with arbitrary length.
By enabling the real-time generation of expressive and synchronized motions, DiffSHEG showcases its potential for various applications in the development of digital humans and embodied agents.
arXiv Detail & Related papers (2024-01-09T11:38:18Z) - DiffusionPhase: Motion Diffusion in Frequency Domain [69.811762407278]
We introduce a learning-based method for generating high-quality human motion sequences from text descriptions.
Existing techniques struggle with motion diversity and smooth transitions in generating arbitrary-length motion sequences.
We develop a network encoder that converts the motion space into a compact yet expressive parameterized phase space.
arXiv Detail & Related papers (2023-12-07T04:39:22Z) - Decouple Content and Motion for Conditional Image-to-Video Generation [6.634105805557556]
conditional image-to-video (cI2V) generation is to create a believable new video by beginning with the condition, i.e., one image and text.
Previous cI2V generation methods conventionally perform in RGB pixel space, with limitations in modeling motion consistency and visual continuity.
We propose a novel approach by disentangling the target RGB pixels into two distinct components: spatial content and temporal motions.
arXiv Detail & Related papers (2023-11-24T06:08:27Z) - Motion and Context-Aware Audio-Visual Conditioned Video Prediction [58.9467115916639]
We decouple the audio-visual conditioned video prediction into motion and appearance modeling.
The multimodal motion estimation predicts future optical flow based on the audio-motion correlation.
We propose context-aware refinement to address the diminishing of the global appearance context.
arXiv Detail & Related papers (2022-12-09T05:57:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.