MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation
- URL: http://arxiv.org/abs/2512.03034v1
- Date: Tue, 02 Dec 2025 18:55:53 GMT
- Title: MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation
- Authors: Youxin Pang, Jiajun Liu, Lingfeng Tan, Yong Zhang, Feng Gao, Xiang Deng, Zhuoliang Kang, Xiaoming Wei, Yebin Liu,
- Abstract summary: We propose MAViD, a novel Multimodal framework for Audio-Visual Dialogue understanding and generation.<n>Our framework can generate vivid and contextually coherent long-duration dialogue interactions and accurately interpret users' multimodal queries.
- Score: 59.23161833385837
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose MAViD, a novel Multimodal framework for Audio-Visual Dialogue understanding and generation. Existing approaches primarily focus on non-interactive systems and are limited to producing constrained and unnatural human speech.The primary challenge of this task lies in effectively integrating understanding and generation capabilities, as well as achieving seamless multimodal audio-video fusion. To solve these problems, we propose a Conductor-Creator architecture that divides the dialogue system into two primary components.The Conductor is tasked with understanding, reasoning, and generating instructions by breaking them down into motion and speech components, thereby enabling fine-grained control over interactions. The Creator then delivers interactive responses based on these instructions.Furthermore, to address the difficulty of generating long videos with consistent identity, timbre, and tone using dual DiT structures, the Creator adopts a structure that combines autoregressive (AR) and diffusion models. The AR model is responsible for audio generation, while the diffusion model ensures high-quality video generation.Additionally, we propose a novel fusion module to enhance connections between contextually consecutive clips and modalities, enabling synchronized long-duration audio-visual content generation.Extensive experiments demonstrate that our framework can generate vivid and contextually coherent long-duration dialogue interactions and accurately interpret users' multimodal queries.
Related papers
- U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation [48.6868174403074]
We introduce U-Mind, the first unified system for high-intelligence multimodal dialogue.<n>It supports real-time generation and jointly models language, speech, motion, and video synthesis within a single interactive loop.<n>We show that U-Mind achieves state-of-the-art performance on a range of multimodal interaction tasks.
arXiv Detail & Related papers (2026-02-27T07:07:02Z) - Tele-Omni: a Unified Multimodal Framework for Video Generation and Editing [93.8111348452324]
Tele- Omni is a unified framework for video generation and editing that follows multimodal instructions.<n>It supports text-to-video generation, image-to-video generation, first-last-frame video generation, in-context video generation, and in-context video editing.
arXiv Detail & Related papers (2026-02-10T10:01:16Z) - ChatUMM: Robust Context Tracking for Conversational Interleaved Generation [44.19929499646892]
Unified multimodal models (UMMs) have achieved remarkable progress yet remain constrained by a single-turn interaction paradigm.<n>We present ChatUMM, a conversational unified model that excels at robust context tracking to sustain interleaved multimodal generation.<n>ChatUMM derives its capabilities from an interleaved multi-turn training strategy that models serialized text-image streams as a continuous conversational flow.
arXiv Detail & Related papers (2026-02-06T07:11:50Z) - Kling-Omni Technical Report [80.64599716667777]
We present Kling- Omni, a generative framework designed to synthesize high-fidelity videos directly from multimodal visual language inputs.<n>Kling- Omni bridges the functional separation among diverse video generation, editing, and intelligent reasoning tasks.<n>It supports a diverse range of user inputs, including text instructions, reference images, and video contexts, processing them into a unified multimodal representation.
arXiv Detail & Related papers (2025-12-18T17:08:12Z) - BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration [56.98981194478512]
We propose a unified framework that handles a broad range of subject-to-video scenarios.<n>We introduce an MLLM-DiT framework in which a pretrained multimodal large language model performs deep cross-modal reasoning to ground entities.<n>Experiments on the OpenS2V benchmark demonstrate that our method achieves superior performance across subject consistency, naturalness, and text relevance in generated videos.
arXiv Detail & Related papers (2025-10-01T02:41:11Z) - Query-Kontext: An Unified Multimodal Model for Image Generation and Editing [53.765351127477224]
Unified Multimodal Models (UMMs) have demonstrated remarkable performance in text-to-image generation (T2I) and editing (TI2I)<n>We introduce Query-Kontext, a novel approach that bridges the VLM and diffusion model via a multimodal kontext'' composed of semantic cues and coarse-grained image conditions encoded from multimodal inputs.<n> Experiments show that our approach matches strong unified baselines and even outperforms task-specific state-of-the-art methods in several cases.
arXiv Detail & Related papers (2025-09-30T17:59:46Z) - MIDAS: Multimodal Interactive Digital-humAn Synthesis via Real-time Autoregressive Video Generation [23.343080324521434]
We introduce an autoregressive video generation framework that enables interactive multimodal control and low-latency extrapolation in a streaming manner.<n>Our framework accepts multimodal condition encodings including audio, pose, and text, and outputs spatially and semantically coherent representations.<n>To support this, we construct a large-scale dialogue dataset of approximately 20,000 hours from multiple sources.
arXiv Detail & Related papers (2025-08-26T14:00:16Z) - A Unified Multi-Agent Framework for Universal Multimodal Understanding and Generation [8.021435739965982]
We propose a modular framework that unifies multimodal understanding and generation via two decoupled phases: Cognition and Deliberation.<n>In Cognition, three role-conditioned multimodal LLM agents - Perceiver, Planner, and Reflector - engage in collaborative dialogue to perform structured understanding and planning.<n>The Deliberation phase incorporates a Growth-Aware Search mechanism that orchestrates LLM-based reasoning and diffusion-based generation in a mutually reinforcing manner.
arXiv Detail & Related papers (2025-08-14T09:52:51Z) - ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing [47.14083940177122]
ThinkSound is a novel framework that enables stepwise, interactive audio generation and editing for videos.<n>Our approach decomposes the process into three complementary stages: semantically coherent, interactive object-centric refinement, and targeted editing.<n> Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics.
arXiv Detail & Related papers (2025-06-26T16:32:06Z) - AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation [65.06374691172061]
multimodal-to-speech task has gained increasing attention due to its wide range of applications, such as film production, dubbing, and virtual avatars.<n>Existing methods still suffer from limitations in speech intelligibility, audio-video synchronization, speech naturalness, and voice similarity to the reference speaker.<n>We propose AlignDiT, a multimodal Aligned Diffusion Transformer that generates accurate, synchronized, and natural-sounding speech from aligned multimodal inputs.
arXiv Detail & Related papers (2025-04-29T10:56:24Z) - OmniTalker: One-shot Real-time Text-Driven Talking Audio-Video Generation With Multimodal Style Mimicking [22.337906095079198]
We present OmniTalker, a unified framework that jointly generates synchronized talking audio-video content from input text.<n>Our framework adopts a dual-branch diffusion transformer (DiT) architecture, with one branch dedicated to audio generation and the other to video synthesis.
arXiv Detail & Related papers (2025-04-03T09:48:13Z) - VX2TEXT: End-to-End Learning of Video-Based Text Generation From
Multimodal Inputs [103.99315770490163]
We present a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio.
Experiments demonstrate that our approach based on a single architecture outperforms the state-of-the-art on three video-based text-generation tasks.
arXiv Detail & Related papers (2021-01-28T15:22:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.