Related papers: OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding

OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding

URL: http://arxiv.org/abs/2504.10825v1
Date: Tue, 15 Apr 2025 03:05:46 GMT
Title: OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding
Authors: Dianbing Xi, Jiepeng Wang, Yuanzhi Liang, Xi Qiu, Yuchi Huo, Rui Wang, Chi Zhang, Xuelong Li,
Abstract summary: We propose a novel framework for controllable video diffusion, OmniVDiff, aiming to synthesize and comprehend multiple video visual content in a single diffusion model.<n>Our model supports three key functionalities: text-conditioned video generation, video understanding, and X-conditioned video generation.<n>By integrating these diverse tasks into a unified video diffusion framework, OmniVDiff enhances the flexibility and scalability for controllable video diffusion.
Score: 48.39161756773205
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: In this paper, we propose a novel framework for controllable video diffusion, OmniVDiff, aiming to synthesize and comprehend multiple video visual content in a single diffusion model. To achieve this, OmniVDiff treats all video visual modalities in the color space to learn a joint distribution, while employing an adaptive control strategy that dynamically adjusts the role of each visual modality during the diffusion process, either as a generation modality or a conditioning modality. This allows flexible manipulation of each modality's role, enabling support for a wide range of tasks. Consequently, our model supports three key functionalities: (1) Text-conditioned video generation: multi-modal visual video sequences (i.e., rgb, depth, canny, segmentaion) are generated based on the text conditions in one diffusion process; (2) Video understanding: OmniVDiff can estimate the depth, canny map, and semantic segmentation across the input rgb frames while ensuring coherence with the rgb input; and (3) X-conditioned video generation: OmniVDiff generates videos conditioned on fine-grained attributes (e.g., depth maps or segmentation maps). By integrating these diverse tasks into a unified video diffusion framework, OmniVDiff enhances the flexibility and scalability for controllable video diffusion, making it an effective tool for a variety of downstream applications, such as video-to-video translation. Extensive experiments demonstrate the effectiveness of our approach, highlighting its potential for various video-related applications.

Related papers

MSNeRV: Neural Video Representation with Multi-Scale Feature Fusion [27.621656985302973]
Implicit Neural representations (INRs) have emerged as a promising approach for video compression.<n>Existing INR-based methods struggle to effectively represent detail-intensive and fast-changing video content.<n>We propose a multi-scale feature fusion framework, MSNeRV, for neural video representation.
arXiv Detail & Related papers (2025-06-18T08:57:12Z)
OmniV2V: Versatile Video Generation and Editing via Dynamic Content Manipulation [22.970558073760433]
We propose OmniV2V, a video model capable of generating and editing videos across different scenarios based on various operations.<n>In addition, we design a visual-text instruction module based on LLaVA, enabling the model to effectively understand the correspondence between visual content and instructions.<n>Experiments show that OmniV2V works as well as, and sometimes better than, the best existing open-source and commercial models for many video generation and editing tasks.
arXiv Detail & Related papers (2025-06-02T15:42:06Z)
MAGREF: Masked Guidance for Any-Reference Video Generation [33.35245169242822]
MAGREF is a unified framework for any-reference video generation.<n>We propose a region-aware dynamic masking mechanism that enables a single model to flexibly handle various subject inference.<n>Our model delivers state-of-the-art video generation quality, generalizing from single-subject training to complex multi-subject scenarios.
arXiv Detail & Related papers (2025-05-29T17:58:15Z)
BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations [82.94002870060045]
Existing video generation models struggle to follow complex text prompts and synthesize multiple objects. We develop a blob-grounded video diffusion model named BlobGEN-Vid that allows users to control object motions and fine-grained object appearance. We show that our framework is model-agnostic and build BlobGEN-Vid based on both U-Net and DiT-based video diffusion models.
arXiv Detail & Related papers (2025-01-13T19:17:06Z)
Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation [54.21476271127356]
Divot is a Diffusion-Powered Video Tokenizer.<n>We present Divot-unaVic through video-to-text autoregression and text-to-video generation.
arXiv Detail & Related papers (2024-12-05T18:53:04Z)
Optical-Flow Guided Prompt Optimization for Coherent Video Generation [51.430833518070145]
We propose a framework called MotionPrompt that guides the video generation process via optical flow.<n>We optimize learnable token embeddings during reverse sampling steps by using gradients from a trained discriminator applied to random frame pairs.<n>This approach allows our method to generate visually coherent video sequences that closely reflect natural motion dynamics, without compromising the fidelity of the generated content.
arXiv Detail & Related papers (2024-11-23T12:26:52Z)
Realizing Video Summarization from the Path of Language-based Semantic Understanding [19.825666473712197]
We propose a novel video summarization framework inspired by the Mixture of Experts (MoE) paradigm. Our approach integrates multiple VideoLLMs to generate comprehensive and coherent textual summaries.
arXiv Detail & Related papers (2024-10-06T15:03:22Z)
MM-LDM: Multi-Modal Latent Diffusion Model for Sounding Video Generation [14.28357169715152]
We introduce a novel multi-modal latent diffusion model (MM-LDM) for the task. We first unify the representation of audio and video data by converting them into a single or a couple of images. Then, we introduce a hierarchical multi-modal autoencoder that constructs a low-level perceptual latent space for each modality and a shared high-level semantic feature space.
arXiv Detail & Related papers (2024-10-02T14:32:24Z)
VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining. We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts. We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z)
MEVG: Multi-event Video Generation with Text-to-Video Models [18.06640097064693]
We introduce a novel diffusion-based video generation method, generating a video showing multiple events given multiple individual sentences from the user. Our method does not require a large-scale video dataset since our method uses a pre-trained text-to-video generative model without a fine-tuning process. Our proposed method is superior to other video-generative models in terms of temporal coherency of content and semantics.
arXiv Detail & Related papers (2023-12-07T06:53:25Z)
Deep Video Prior for Video Consistency and Propagation [58.250209011891904]
We present a novel and general approach for blind video temporal consistency. Our method is only trained on a pair of original and processed videos directly instead of a large dataset. We show that temporal consistency can be achieved by training a convolutional neural network on a video with Deep Video Prior.
arXiv Detail & Related papers (2022-01-27T16:38:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.