Wan-S2V: Audio-Driven Cinematic Video Generation
- URL: http://arxiv.org/abs/2508.18621v1
- Date: Tue, 26 Aug 2025 02:51:31 GMT
- Title: Wan-S2V: Audio-Driven Cinematic Video Generation
- Authors: Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, Dechao Meng, Jinwei Qi, Penchong Qiao, Zhen Shen, Yafei Song, Ke Sun, Linrui Tian, Guangyuan Wang, Qi Wang, Zhongjian Wang, Jiayu Xiao, Sheng Xu, Bang Zhang, Peng Zhang, Xindi Zhang, Zhe Zhang, Jingren Zhou, Lian Zhuo,
- Abstract summary: Current state-of-the-art (SOTA) methods for audio-driven character animation demonstrate promising performance for scenarios involving speech and singing.<n>We propose an audio-driven model, which we refere to as Wan-S2V, built upon Wan.
- Score: 53.59046547320306
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current state-of-the-art (SOTA) methods for audio-driven character animation demonstrate promising performance for scenarios primarily involving speech and singing. However, they often fall short in more complex film and television productions, which demand sophisticated elements such as nuanced character interactions, realistic body movements, and dynamic camera work. To address this long-standing challenge of achieving film-level character animation, we propose an audio-driven model, which we refere to as Wan-S2V, built upon Wan. Our model achieves significantly enhanced expressiveness and fidelity in cinematic contexts compared to existing approaches. We conducted extensive experiments, benchmarking our method against cutting-edge models such as Hunyuan-Avatar and Omnihuman. The experimental results consistently demonstrate that our approach significantly outperforms these existing solutions. Additionally, we explore the versatility of our method through its applications in long-form video generation and precise video lip-sync editing.
Related papers
- Playmate2: Training-Free Multi-Character Audio-Driven Animation via Diffusion Transformer with Reward Feedback [9.569613635896026]
We propose a diffusion transformer (DiT)-based framework for generating talking videos of arbitrary length.<n>We also introduce a training-free method for multi-character audio-driven animation.<n> Experimental results demonstrate that our method outperforms existing state-of-the-art approaches.
arXiv Detail & Related papers (2025-10-14T02:50:05Z) - X-Actor: Emotional and Expressive Long-Range Portrait Acting from Audio [27.619816538121327]
X-Actor generates lifelike, emotionally expressive talking head videos from a single reference image and an input audio clip.<n>By operating in a compact facial motion latent space decoupled from visual and identity cues, our autoregressive diffusion model effectively captures long-range correlations between audio and facial dynamics.<n>X-Actor produces compelling, cinematic-style performances that go beyond standard talking head animations.
arXiv Detail & Related papers (2025-08-04T22:57:01Z) - EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation [8.214084596349744]
EchoMimicV3 is an efficient framework that unifies multi-task and multi-modal human animation.<n>With a minimal model size of 1.3 billion parameters, EchoMimicV3 achieves competitive performance in both quantitative and qualitative evaluations.
arXiv Detail & Related papers (2025-07-05T05:36:26Z) - HunyuanVideo-Avatar: High-Fidelity Audio-Driven Human Animation for Multiple Characters [14.594698765723756]
HunyuanVideo-Avatar is a model capable of simultaneously generating dynamic, emotion-controllable, and multi-character dialogue videos.<n>A character image injection module is designed to replace the conventional addition-based character conditioning scheme.<n>An Audio Emotion Module (AEM) is introduced to extract and transfer the emotional cues from an emotion reference image to the target generated video.<n>A Face-Aware Audio Adapter (FAA) is proposed to isolate the audio-driven character with latent-level face mask.
arXiv Detail & Related papers (2025-05-26T15:57:27Z) - Prosody-Enhanced Acoustic Pre-training and Acoustic-Disentangled Prosody Adapting for Movie Dubbing [60.38045088180188]
We propose an acoustic-prosody disentangled two-stage method to achieve high-quality dubbing generation with precise prosody alignment.<n>We incorporate an in-domain emotion analysis module to reduce the impact of visual domain shifts across different movies.<n>Our method performs favorably against the state-of-the-art models on two primary benchmarks.
arXiv Detail & Related papers (2025-03-15T08:25:57Z) - EMO2: End-Effector Guided Audio-Driven Avatar Video Generation [17.816939983301474]
We propose a novel audio-driven talking head method capable of simultaneously generating highly expressive facial expressions and hand gestures.<n>In the first stage, we generate hand poses directly from audio input, leveraging the strong correlation between audio signals and hand movements.<n>In the second stage, we employ a diffusion model to synthesize video frames, incorporating the hand poses generated in the first stage to produce realistic facial expressions and body movements.
arXiv Detail & Related papers (2025-01-18T07:51:29Z) - Motion Prompting: Controlling Video Generation with Motion Trajectories [57.049252242807874]
We train a video generation model conditioned on sparse or dense video trajectories.<n>We translate high-level user requests into detailed, semi-dense motion prompts.<n>We demonstrate our approach through various applications, including camera and object motion control, "interacting" with an image, motion transfer, and image editing.
arXiv Detail & Related papers (2024-12-03T18:59:56Z) - DAWN: Dynamic Frame Avatar with Non-autoregressive Diffusion Framework for Talking Head Video Generation [50.66658181705527]
We present DAWN, a framework that enables all-at-once generation of dynamic-length video sequences.<n>DAWN consists of two main components: (1) audio-driven holistic facial dynamics generation in the latent motion space, and (2) audio-driven head pose and blink generation.<n>Our method generates authentic and vivid videos with precise lip motions, and natural pose/blink movements.
arXiv Detail & Related papers (2024-10-17T16:32:36Z) - Animate Your Motion: Turning Still Images into Dynamic Videos [58.63109848837741]
We introduce Scene and Motion Conditional Diffusion (SMCD), a novel methodology for managing multimodal inputs.
SMCD incorporates a recognized motion conditioning module and investigates various approaches to integrate scene conditions.
Our design significantly enhances video quality, motion precision, and semantic coherence.
arXiv Detail & Related papers (2024-03-15T10:36:24Z) - Dance In the Wild: Monocular Human Animation with Neural Dynamic
Appearance Synthesis [56.550999933048075]
We propose a video based synthesis method that tackles challenges and demonstrates high quality results for in-the-wild videos.
We introduce a novel motion signature that is used to modulate the generator weights to capture dynamic appearance changes.
We evaluate our method on a set of challenging videos and show that our approach achieves state-of-the art performance both qualitatively and quantitatively.
arXiv Detail & Related papers (2021-11-10T20:18:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.