Related papers: MIDAS: Multimodal Interactive Digital-humAn Synthesis via Real-time Autoregressive Video Generation

MIDAS: Multimodal Interactive Digital-humAn Synthesis via Real-time Autoregressive Video Generation

URL: http://arxiv.org/abs/2508.19320v2
Date: Thu, 28 Aug 2025 09:15:43 GMT
Title: MIDAS: Multimodal Interactive Digital-humAn Synthesis via Real-time Autoregressive Video Generation
Authors: Ming Chen, Liyuan Cui, Wenyuan Zhang, Haoxian Zhang, Yan Zhou, Xiaohan Li, Songlin Tang, Jiwen Liu, Borui Liao, Hejia Chen, Xiaoqiang Liu, Pengfei Wan,
Abstract summary: We introduce an autoregressive video generation framework that enables interactive multimodal control and low-latency extrapolation in a streaming manner.<n>Our framework accepts multimodal condition encodings including audio, pose, and text, and outputs spatially and semantically coherent representations.<n>To support this, we construct a large-scale dialogue dataset of approximately 20,000 hours from multiple sources.
Score: 23.343080324521434
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, interactive digital human video generation has attracted widespread attention and achieved remarkable progress. However, building such a practical system that can interact with diverse input signals in real time remains challenging to existing methods, which often struggle with heavy computational cost and limited controllability. In this work, we introduce an autoregressive video generation framework that enables interactive multimodal control and low-latency extrapolation in a streaming manner. With minimal modifications to a standard large language model (LLM), our framework accepts multimodal condition encodings including audio, pose, and text, and outputs spatially and semantically coherent representations to guide the denoising process of a diffusion head. To support this, we construct a large-scale dialogue dataset of approximately 20,000 hours from multiple sources, providing rich conversational scenarios for training. We further introduce a deep compression autoencoder with up to 64$\times$ reduction ratio, which effectively alleviates the long-horizon inference burden of the autoregressive model. Extensive experiments on duplex conversation, multilingual human synthesis, and interactive world model highlight the advantages of our approach in low latency, high efficiency, and fine-grained multimodal controllability.

Related papers

LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation [35.01134463094784]
Real-time video generation via diffusion is essential for building general-purpose multimodal interactive AI systems.<n>Existing distillation methods can make the model autoregressive and reduce sampling steps to mitigate this.<n>This paper targets real-time interactive video diffusion conditioned on a multimodal context, including text, image, and audio, to bridge the gap.
arXiv Detail & Related papers (2025-12-29T16:17:36Z)
MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation [59.23161833385837]
We propose MAViD, a novel Multimodal framework for Audio-Visual Dialogue understanding and generation.<n>Our framework can generate vivid and contextually coherent long-duration dialogue interactions and accurately interpret users' multimodal queries.
arXiv Detail & Related papers (2025-12-02T18:55:53Z)
X-Streamer: Unified Human World Modeling with Audiovisual Interaction [36.50697656708077]
X-Streamer is a framework for building digital human agents capable of infinite interactions across text, speech, and video.<n>At its core is a Thinker-Actor dual-transformer architecture that unifies multimodal understanding and generation.<n>X-Streamer runs in real time on two A100 GPUs, sustaining hours-long consistent video chat experiences.
arXiv Detail & Related papers (2025-09-25T20:53:27Z)
LoViC: Efficient Long Video Generation with Context Compression [68.22069741704158]
We introduce LoViC, a DiT-based framework trained on million-scale open-domain videos.<n>At the core of our approach is FlexFormer, an expressive autoencoder that jointly compresses video and text into unified latent representations.
arXiv Detail & Related papers (2025-07-17T09:46:43Z)
AsynFusion: Towards Asynchronous Latent Consistency Models for Decoupled Whole-Body Audio-Driven Avatars [65.53676584955686]
Whole-body audio-driven avatar pose and expression generation is a critical task for creating lifelike digital humans.<n>We propose AsynFusion, a novel framework that leverages diffusion transformers to achieve cohesive expression and gesture synthesis.<n>AsynFusion achieves state-of-the-art performance in generating real-time, synchronized whole-body animations.
arXiv Detail & Related papers (2025-05-21T03:28:53Z)
AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation [65.06374691172061]
multimodal-to-speech task has gained increasing attention due to its wide range of applications, such as film production, dubbing, and virtual avatars.<n>Existing methods still suffer from limitations in speech intelligibility, audio-video synchronization, speech naturalness, and voice similarity to the reference speaker.<n>We propose AlignDiT, a multimodal Aligned Diffusion Transformer that generates accurate, synchronized, and natural-sounding speech from aligned multimodal inputs.
arXiv Detail & Related papers (2025-04-29T10:56:24Z)
Multimodal Latent Language Modeling with Next-Token Diffusion [111.93906046452125]
Multimodal generative models require a unified approach to handle both discrete data (e.g., text and code) and continuous data (e.g., image, audio, video)<n>We propose Latent Language Modeling (LatentLM), which seamlessly integrates continuous and discrete data using causal Transformers.
arXiv Detail & Related papers (2024-12-11T18:57:32Z)
Efficient Long-duration Talking Video Synthesis with Linear Diffusion Transformer under Multimodal Guidance [36.99310116405025]
Long-duration synthesis faces persistent challenges in simultaneously achieving high quality, portrait and temporal consistency, and computational efficiency.<n>Here, we present LetsTalk, a transformer diffusion framework that incorporates multimodal guidance and a novel memory bank mechanism.<n>Experiments demonstrate that LetsTalk achieves temporal coherent and realistic talking videos with enhanced diversity and liveliness, while maintaining remarkable efficiency with 8 fewer parameters than previous approaches.
arXiv Detail & Related papers (2024-11-24T04:46:00Z)
Large Body Language Models [1.9797215742507548]
We introduce Large Body Language Models (LBLMs) and present LBLM-AVA, a novel LBLM architecture that combines a Transformer-XL large language model with a parallelized diffusion model to generate human-like gestures from multimodal inputs (text, audio, and video) LBLM-AVA achieves state-of-the-art performance in generating lifelike and contextually appropriate gestures with a 30% reduction in Freche's Gesture Distance (FGD) and a 25% improvement in Freche's Inception Distance compared to existing approaches.
arXiv Detail & Related papers (2024-10-21T21:48:24Z)
Towards Multi-Task Multi-Modal Models: A Video Generative Perspective [5.495245220300184]
This thesis chronicles our endeavor to build multi-task models for generating videos and other modalities under diverse conditions. We unveil a novel approach to mapping bidirectionally between visual observation and interpretable lexical terms. Our scalable visual token representation proves beneficial across generation, compression, and understanding tasks.
arXiv Detail & Related papers (2024-05-26T23:56:45Z)
TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild [102.93338424976959]
We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved instruction-following capabilities. Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model. To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models.
arXiv Detail & Related papers (2023-09-14T15:34:01Z)
A Comprehensive Multi-scale Approach for Speech and Dynamics Synchrony in Talking Head Generation [16.033455552126348]
We propose a multi-scale audio-visual synchrony loss and a multi-scale autoregressive GAN.<n>We train a stack of syncer models on multimodal input pyramids and use these models as guidance in a multi-scale generator network.<n>Experiments show significant improvements over the state-of-the-art in head motion dynamics quality.
arXiv Detail & Related papers (2023-07-04T08:29:59Z)
Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations. We autoregressively output multiple possibilities of corresponding listener motion. Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.