Related papers: Omni-Video 2: Scaling MLLM-Conditioned Diffusion for Unified Video Generation and Editing

Omni-Video 2: Scaling MLLM-Conditioned Diffusion for Unified Video Generation and Editing

URL: http://arxiv.org/abs/2602.08820v1
Date: Mon, 09 Feb 2026 15:56:05 GMT
Title: Omni-Video 2: Scaling MLLM-Conditioned Diffusion for Unified Video Generation and Editing
Authors: Hao Yang, Zhiyu Tan, Jia Gong, Luozheng Qin, Hesen Chen, Xiaomeng Yang, Yuqing Sun, Yuetan Lin, Mengping Yang, Hao Li,
Abstract summary: We present a scalable and computationally efficient model that connects pretrained multimodal large-language models (MLLMs) with video diffusion models for unified video generation and editing.<n>Our key idea is to exploit the understanding and reasoning capabilities of MLLMs to produce explicit target captions to interpret user instructions.<n>We evaluate the performance of Omni-Video 2 on the FiVE benchmark for fine-grained video editing and the VBench benchmark for text-to-video generation.
Score: 21.525921468472685
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: We present Omni-Video 2, a scalable and computationally efficient model that connects pretrained multimodal large-language models (MLLMs) with video diffusion models for unified video generation and editing. Our key idea is to exploit the understanding and reasoning capabilities of MLLMs to produce explicit target captions to interpret user instructions. In this way, the rich contextual representations from the understanding model are directly used to guide the generative process, thereby improving performance on complex and compositional editing. Moreover, a lightweight adapter is developed to inject multimodal conditional tokens into pretrained text-to-video diffusion models, allowing maximum reuse of their powerful generative priors in a parameter-efficient manner. Benefiting from these designs, we scale up Omni-Video 2 to a 14B video diffusion model on meticulously curated training data with quality, supporting high quality text-to-video generation and various video editing tasks such as object removal, addition, background change, complex motion editing, \emph{etc.} We evaluate the performance of Omni-Video 2 on the FiVE benchmark for fine-grained video editing and the VBench benchmark for text-to-video generation. The results demonstrate its superior ability to follow complex compositional instructions in video editing, while also achieving competitive or superior quality in video generation tasks.

Related papers

Tele-Omni: a Unified Multimodal Framework for Video Generation and Editing [93.8111348452324]
Tele- Omni is a unified framework for video generation and editing that follows multimodal instructions.<n>It supports text-to-video generation, image-to-video generation, first-last-frame video generation, in-context video generation, and in-context video editing.
arXiv Detail & Related papers (2026-02-10T10:01:16Z)
LiViBench: An Omnimodal Benchmark for Interactive Livestream Video Understanding [23.207637210563504]
LiViBench is an omnimodal benchmark for interactive livestream videos.<n>It features a diverse set of 24 tasks, highlighting the perceptual, reasoning, and livestream-specific challenges.<n>We develop LiVi-LLM-7B, an MLLM with enhanced knowledge of interactive livestreams.
arXiv Detail & Related papers (2026-01-21T14:14:20Z)
RISE-T2V: Rephrasing and Injecting Semantics with LLM for Expansive Text-to-Video Generation [19.127189099122244]
We introduce RISE-T2V, which uniquely integrates the processes of prompt rephrasing and semantic feature extraction into a single step.<n>We propose an innovative module called the Rephrasing Adapter, enabling diffusion models to utilize text hidden states.
arXiv Detail & Related papers (2025-11-06T12:42:03Z)
Omni-Video: Democratizing Unified Video Understanding and Generation [13.616454543808798]
This report presents Omni-Video, an efficient and effective unified framework for video understanding, generation, as well as instruction-based editing.<n>Our key insight is to teach existing multimodal large language models (MLLMs) to produce continuous visual clues that are used as the input of diffusion decoders.<n>To fully unlock the potential of our system for unified video modeling, we integrate several technical improvements.
arXiv Detail & Related papers (2025-07-08T16:02:16Z)
Text-to-Edit: Controllable End-to-End Video Ad Creation via Multimodal LLMs [6.300563383392837]
The exponential growth of short-video content has ignited a surge in the necessity for efficient, automated solutions to video editing.<n>We propose an innovative end-to-end foundational framework, ultimately actualizing precise control over the final video content editing.
arXiv Detail & Related papers (2025-01-10T11:35:43Z)
Mimir: Improving Video Diffusion Models for Precise Text Understanding [53.72393225042688]
Text serves as the key control signal in video generation due to its narrative nature.<n>The recent success of large language models (LLMs) showcases the power of decoder-only transformers.<n>This work addresses this challenge with Mimir, an end-to-end training framework featuring a carefully tailored token fuser.
arXiv Detail & Related papers (2024-12-04T07:26:44Z)
When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding [118.72266141321647]
Cross-Modality Video Coding (CMVC) is a pioneering approach to explore multimodality representation and video generative models in video coding.<n>During decoding, previously encoded components and video generation models are leveraged to create multiple encoding-decoding modes.<n>Experiments indicate that TT2V achieves effective semantic reconstruction, while IT2V exhibits competitive perceptual consistency.
arXiv Detail & Related papers (2024-08-15T11:36:18Z)
RACCooN: A Versatile Instructional Video Editing Framework with Auto-Generated Narratives [74.01707548681405]
This paper proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video generative framework.<n>Our video generative model incorporates auto-generated narratives or instructions to enhance the quality and accuracy of the generated content.<n>The proposed framework demonstrates impressive versatile capabilities in video-to-paragraph generation, video content editing, and can be incorporated into other SoTA video generative models for further enhancement.
arXiv Detail & Related papers (2024-05-28T17:46:36Z)
Video-Teller: Enhancing Cross-Modal Generation with Fusion and Decoupling [79.49128866877922]
Video-Teller is a video-language foundation model that leverages multi-modal fusion and fine-grained modality alignment. Video-Teller boosts the training efficiency by utilizing frozen pretrained vision and language modules. It capitalizes on the robust linguistic capabilities of large language models, enabling the generation of both concise and elaborate video descriptions.
arXiv Detail & Related papers (2023-10-08T03:35:27Z)
InstructVid2Vid: Controllable Video Editing with Natural Language Instructions [97.17047888215284]
InstructVid2Vid is an end-to-end diffusion-based methodology for video editing guided by human language instructions. Our approach empowers video manipulation guided by natural language directives, eliminating the need for per-example fine-tuning or inversion.
arXiv Detail & Related papers (2023-05-21T03:28:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.