Related papers: UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation

UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation

URL: http://arxiv.org/abs/2512.07831v1
Date: Mon, 08 Dec 2025 18:59:01 GMT
Title: UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation
Authors: Jiehui Huang, Yuechen Zhang, Xu He, Yuan Gao, Zhi Cen, Bin Xia, Yan Zhou, Xin Tao, Pengfei Wan, Jiaya Jia,
Abstract summary: We introduce UnityVideo, a unified framework for world-aware video generation.<n>Our approach features two core components: (1) dynamic noising to unify heterogeneous training paradigms, and (2) a modality switcher with an in-context learner.<n>We demonstrate that UnityVideo achieves superior video quality, consistency, and improved alignment with physical world constraints.
Score: 61.98887854225878
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent video generation models demonstrate impressive synthesis capabilities but remain limited by single-modality conditioning, constraining their holistic world understanding. This stems from insufficient cross-modal interaction and limited modal diversity for comprehensive world knowledge representation. To address these limitations, we introduce UnityVideo, a unified framework for world-aware video generation that jointly learns across multiple modalities (segmentation masks, human skeletons, DensePose, optical flow, and depth maps) and training paradigms. Our approach features two core components: (1) dynamic noising to unify heterogeneous training paradigms, and (2) a modality switcher with an in-context learner that enables unified processing via modular parameters and contextual learning. We contribute a large-scale unified dataset with 1.3M samples. Through joint optimization, UnityVideo accelerates convergence and significantly enhances zero-shot generalization to unseen data. We demonstrate that UnityVideo achieves superior video quality, consistency, and improved alignment with physical world constraints. Code and data can be found at: https://github.com/dvlab-research/UnityVideo

Related papers

Kling-Omni Technical Report [80.64599716667777]
We present Kling- Omni, a generative framework designed to synthesize high-fidelity videos directly from multimodal visual language inputs.<n>Kling- Omni bridges the functional separation among diverse video generation, editing, and intelligent reasoning tasks.<n>It supports a diverse range of user inputs, including text instructions, reference images, and video contexts, processing them into a unified multimodal representation.
arXiv Detail & Related papers (2025-12-18T17:08:12Z)
Omni-Video: Democratizing Unified Video Understanding and Generation [13.616454543808798]
This report presents Omni-Video, an efficient and effective unified framework for video understanding, generation, as well as instruction-based editing.<n>Our key insight is to teach existing multimodal large language models (MLLMs) to produce continuous visual clues that are used as the input of diffusion decoders.<n>To fully unlock the potential of our system for unified video modeling, we integrate several technical improvements.
arXiv Detail & Related papers (2025-07-08T16:02:16Z)
VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining. We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts. We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z)
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action [46.76487873983082]
Unified-IO 2 is the first autoregressive multimodal model capable of understanding and generating image, text, audio, and action. We train our model from scratch on a large multimodal pre-training corpus from diverse sources. With a single unified model, Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark.
arXiv Detail & Related papers (2023-12-28T17:57:06Z)
Pre-training Contextualized World Models with In-the-wild Videos for Reinforcement Learning [54.67880602409801]
In this paper, we study the problem of pre-training world models with abundant in-the-wild videos for efficient learning of visual control tasks. We introduce Contextualized World Models (ContextWM) that explicitly separate context and dynamics modeling. Our experiments show that in-the-wild video pre-training equipped with ContextWM can significantly improve the sample efficiency of model-based reinforcement learning.
arXiv Detail & Related papers (2023-05-29T14:29:12Z)
InternVideo: General Video Foundation Models via Generative and Discriminative Learning [52.69422763715118]
We present general video foundation models, InternVideo, for dynamic and complex video-level understanding tasks. InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives. InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications.
arXiv Detail & Related papers (2022-12-06T18:09:49Z)
i-Code: An Integrative and Composable Multimodal Learning Framework [99.56065789066027]
i-Code is a self-supervised pretraining framework where users may flexibly combine the modalities of vision, speech, and language into unified and general-purpose vector representations. The entire system is pretrained end-to-end with new objectives including masked modality unit modeling and cross-modality contrastive learning. Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on five video understanding tasks and the GLUE NLP benchmark, improving by as much as 11%.
arXiv Detail & Related papers (2022-05-03T23:38:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.