Related papers: Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis

Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis

URL: http://arxiv.org/abs/2509.09595v2
Date: Wed, 17 Sep 2025 04:08:08 GMT
Title: Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis
Authors: Yikang Ding, Jiwen Liu, Wenyuan Zhang, Zekun Wang, Wentao Hu, Liyuan Cui, Mingming Lao, Yingchao Shao, Hui Liu, Xiaohan Li, Ming Chen, Xiaoqiang Liu, Yu-Shen Liu, Pengfei Wan,
Abstract summary: We introduce Kling-Avatar, a novel framework that unifies multimodal instruction understanding with portrait generation.<n>Our method is capable of generating vivid, fluent, long-duration videos at up to 1080p and 48 fps.<n>These results establish Kling-Avatar as a new benchmark for semantically grounded, high-fidelity audio synthesis.
Score: 48.47254451688591
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in audio-driven avatar video generation have significantly enhanced audio-visual realism. However, existing methods treat instruction conditioning merely as low-level tracking driven by acoustic or visual cues, without modeling the communicative purpose conveyed by the instructions. This limitation compromises their narrative coherence and character expressiveness. To bridge this gap, we introduce Kling-Avatar, a novel cascaded framework that unifies multimodal instruction understanding with photorealistic portrait generation. Our approach adopts a two-stage pipeline. In the first stage, we design a multimodal large language model (MLLM) director that produces a blueprint video conditioned on diverse instruction signals, thereby governing high-level semantics such as character motion and emotions. In the second stage, guided by blueprint keyframes, we generate multiple sub-clips in parallel using a first-last frame strategy. This global-to-local framework preserves fine-grained details while faithfully encoding the high-level intent behind multimodal instructions. Our parallel architecture also enables fast and stable generation of long-duration videos, making it suitable for real-world applications such as digital human livestreaming and vlogging. To comprehensively evaluate our method, we construct a benchmark of 375 curated samples covering diverse instructions and challenging scenarios. Extensive experiments demonstrate that Kling-Avatar is capable of generating vivid, fluent, long-duration videos at up to 1080p and 48 fps, achieving superior performance in lip synchronization accuracy, emotion and dynamic expressiveness, instruction controllability, identity preservation, and cross-domain generalization. These results establish Kling-Avatar as a new benchmark for semantically grounded, high-fidelity audio-driven avatar synthesis.

Related papers

JoyAvatar: Unlocking Highly Expressive Avatars via Harmonized Text-Audio Conditioning [18.72712280434528]
JoyAvatar is a framework capable of generating long duration avatar videos.<n>We introduce a twin-teacher enhanced training algorithm that enables the model to transfer inherent text-controllability.<n>During training, we dynamically modulate the strength of multi-modal conditions.
arXiv Detail & Related papers (2026-01-31T13:00:57Z)
Lights, Camera, Consistency: A Multistage Pipeline for Character-Stable AI Video Stories [5.022547031373416]
We introduce a method that approaches video generation in a filmmaker-like manner.<n>Instead of creating a video in one step, our proposed pipeline first uses a large language model to generate a detailed production script.<n>This script guides a text-to-image model in creating consistent visuals for each character, which then serve as anchors for a video generation model to synthesize each scene individually.
arXiv Detail & Related papers (2025-12-17T18:10:27Z)
KlingAvatar 2.0 Technical Report [43.949604396366425]
Our model effectively addresses the challenges of efficient, multimodally aligned long-form high-resolution video generation.<n>It delivers enhanced visual clarity, realistic lip-teeth rendering with accurate lip synchronization, strong identity preservation, and coherent multimodal instruction following.
arXiv Detail & Related papers (2025-12-15T13:30:51Z)
TalkCuts: A Large-Scale Dataset for Multi-Shot Human Speech Video Generation [76.48551690189406]
We present TalkCuts, a large-scale dataset designed to facilitate the study of multi-shot human speech video generation.<n>TalkCuts offers 164k clips totaling over 500 hours of high-quality human speech videos with diverse camera shots, including close-up, half-body, and full-body views.<n>The dataset includes detailed textual descriptions, 2D keypoints and 3D SMPL-X motion annotations, covering over 10k identities, enabling multimodal learning and evaluation.
arXiv Detail & Related papers (2025-10-08T17:16:09Z)
Audio Driven Real-Time Facial Animation for Social Telepresence [65.66220599734338]
We present an audio-driven real-time system for animating photorealistic 3D facial avatars with minimal latency.<n>Central to our approach is an encoder model that transforms audio signals into latent facial expression sequences in real time.<n>We capture the rich spectrum of facial expressions necessary for natural communication while achieving real-time performance.
arXiv Detail & Related papers (2025-10-01T17:57:05Z)
AvatarSync: Rethinking Talking-Head Animation through Phoneme-Guided Autoregressive Perspective [15.69417162113696]
AvatarSync is an autoregressive framework on phoneme representations that generates realistic talking-head animations from a single reference image.<n>We show that AvatarSync outperforms existing talking-head animation methods in visual fidelity, temporal consistency, and computational efficiency.
arXiv Detail & Related papers (2025-09-15T15:34:02Z)
OmniHuman-1.5: Instilling an Active Mind in Avatars via Cognitive Simulation [29.41106195298283]
Existing video avatar models can produce fluid human animations, yet they struggle to move beyond mere physical likeness to capture a character's authentic essence.<n>textbfwe propose a framework designed to generate character animations that are not only physically plausible but also semantically coherent and expressive.
arXiv Detail & Related papers (2025-08-26T17:15:26Z)
InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing [66.48064661467781]
We introduce sparse-frame video dubbing, a novel paradigm that strategically preserves references to maintain identity, iconic gestures, and camera trajectories.<n>We propose InfiniteTalk, a streaming audio-driven generator designed for infinite-length long sequence dubbing.<n> Comprehensive evaluations on HDTF, CelebV-HQ, and EMTD datasets demonstrate state-of-the-art performance.
arXiv Detail & Related papers (2025-08-19T17:55:23Z)
Video Perception Models for 3D Scene Synthesis [109.5543506037003]
VIPScene is a novel framework that exploits the encoded commonsense knowledge of the 3D physical world in video generation models.<n>VIPScene seamlessly integrates video generation, feedforward 3D reconstruction, and open-vocabulary perception models to semantically and geometrically analyze each object in a scene.
arXiv Detail & Related papers (2025-06-25T16:40:17Z)
Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Video Diffusion Transformer [25.39030226963548]
We introduce the first application of a pretrained transformer-based video generative model for portrait animation.<n>Our method is validated through experiments on benchmark and newly proposed wild datasets.
arXiv Detail & Related papers (2024-12-01T08:54:30Z)
MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence [62.72540590546812]
MovieDreamer is a novel hierarchical framework that integrates the strengths of autoregressive models with diffusion-based rendering. We present experiments across various movie genres, demonstrating that our approach achieves superior visual and narrative quality.
arXiv Detail & Related papers (2024-07-23T17:17:05Z)
Towards Multi-Task Multi-Modal Models: A Video Generative Perspective [5.495245220300184]
This thesis chronicles our endeavor to build multi-task models for generating videos and other modalities under diverse conditions. We unveil a novel approach to mapping bidirectionally between visual observation and interpretable lexical terms. Our scalable visual token representation proves beneficial across generation, compression, and understanding tasks.
arXiv Detail & Related papers (2024-05-26T23:56:45Z)
From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations [107.88375243135579]
Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands. We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures. Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods.
arXiv Detail & Related papers (2024-01-03T18:55:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.