CtrlVDiff: Controllable Video Generation via Unified Multimodal Video Diffusion
- URL: http://arxiv.org/abs/2511.21129v1
- Date: Wed, 26 Nov 2025 07:27:11 GMT
- Title: CtrlVDiff: Controllable Video Generation via Unified Multimodal Video Diffusion
- Authors: Dianbing Xi, Jiepeng Wang, Yuanzhi Liang, Xi Qiu, Jialun Liu, Hao Pan, Yuchi Huo, Rui Wang, Haibin Huang, Chi Zhang, Xuelong Li,
- Abstract summary: We tackle the dual challenges of video understanding and controllable video generation within a unified diffusion framework.<n>We propose CtrlVDiff, a unified diffusion model trained with a Hybrid Modality Control Strategy (HMCS) that routes and fuses features from depth, normals, segmentation, edges, and graphics-based intrinsics (albedo, roughness, metallic)<n>Across understanding and generation benchmarks, CtrlVDiff delivers superior controllability and fidelity, enabling layer-wise edits (relighting, material adjustment, object insertion) and surpassing state-of-the-art baselines while remaining robust when some modalities are unavailable.
- Score: 62.04833878126661
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We tackle the dual challenges of video understanding and controllable video generation within a unified diffusion framework. Our key insights are two-fold: geometry-only cues (e.g., depth, edges) are insufficient: they specify layout but under-constrain appearance, materials, and illumination, limiting physically meaningful edits such as relighting or material swaps and often causing temporal drift. Enriching the model with additional graphics-based modalities (intrinsics and semantics) provides complementary constraints that both disambiguate understanding and enable precise, predictable control during generation. However, building a single model that uses many heterogeneous cues introduces two core difficulties. Architecturally, the model must accept any subset of modalities, remain robust to missing inputs, and inject control signals without sacrificing temporal consistency. Data-wise, training demands large-scale, temporally aligned supervision that ties real videos to per-pixel multimodal annotations. We then propose CtrlVDiff, a unified diffusion model trained with a Hybrid Modality Control Strategy (HMCS) that routes and fuses features from depth, normals, segmentation, edges, and graphics-based intrinsics (albedo, roughness, metallic), and re-renders videos from any chosen subset with strong temporal coherence. To enable this, we build MMVideo, a hybrid real-and-synthetic dataset aligned across modalities and captions. Across understanding and generation benchmarks, CtrlVDiff delivers superior controllability and fidelity, enabling layer-wise edits (relighting, material adjustment, object insertion) and surpassing state-of-the-art baselines while remaining robust when some modalities are unavailable.
Related papers
- LSA: Localized Semantic Alignment for Enhancing Temporal Consistency in Traffic Video Generation [44.62533878314138]
Localized Semantic Alignment (LSA) is a framework for fine-tuning pre-trained video generation models.<n>LSA enhances temporal consistency by aligning semantic features between ground-truth and generated video clips.<n> experiments on nuScenes and KITTI datasets show the effectiveness of our approach.
arXiv Detail & Related papers (2026-02-05T18:21:02Z) - Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach [99.80480649258557]
DiTFuse is an instruction-driven framework that performs semantics-aware fusion within a single model.<n>Experiments on public IVIF, MFF, and MEF benchmarks confirm superior quantitative and qualitative performance, sharper textures, and better semantic retention.
arXiv Detail & Related papers (2025-12-08T05:04:54Z) - MultiCOIN: Multi-Modal COntrollable Video INbetweening [46.37499813275259]
We introduce MultiCOIN, a video inbetweening framework that allows multi-modal controls.<n>To ensure compatibility between DiT and our multi-modal controls, we map all motion controls into a common sparse representation.<n>We also propose a stage-wise training strategy to ensure that our model learns the multi-modal controls smoothly.
arXiv Detail & Related papers (2025-10-09T17:59:27Z) - Ctrl-VI: Controllable Video Synthesis via Variational Inference [62.79016502243712]
Ctrl-VI is a video synthesis method that generates samples with high controllability for specified elements.<n>We show that our method produces samples with improved controllability, diversity, and 3D consistency compared to prior works.
arXiv Detail & Related papers (2025-10-09T01:48:16Z) - BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration [56.98981194478512]
We propose a unified framework that handles a broad range of subject-to-video scenarios.<n>We introduce an MLLM-DiT framework in which a pretrained multimodal large language model performs deep cross-modal reasoning to ground entities.<n>Experiments on the OpenS2V benchmark demonstrate that our method achieves superior performance across subject consistency, naturalness, and text relevance in generated videos.
arXiv Detail & Related papers (2025-10-01T02:41:11Z) - Temporal In-Context Fine-Tuning for Versatile Control of Video Diffusion Models [34.131515004434846]
We introduce Temporal In-Context Fine-Tuning (TIC-FT), an efficient approach for adapting pretrained video diffusion models to conditional generation tasks.<n>TIC-FT requires no architectural changes and achieves strong performance with as few as 10-30 training samples.<n>We validate our method across a range of tasks, including image-to-video and video-to-video generation, using large-scale base models such as CogVideoX-5B and Wan-14B.
arXiv Detail & Related papers (2025-06-01T12:57:43Z) - MAGREF: Masked Guidance for Any-Reference Video Generation with Subject Disentanglement [47.064467920954776]
We introduce MAGREF, a unified and effective framework for any-reference video generation.<n>Our approach incorporates masked guidance and a subject disentanglement mechanism.<n>Experiments on a comprehensive benchmark demonstrate that MAGREF consistently outperforms existing state-of-the-art approaches.
arXiv Detail & Related papers (2025-05-29T17:58:15Z) - Enabling Versatile Controls for Video Diffusion Models [18.131652071161266]
VCtrl is a novel framework designed to enable fine control over pre-trained video diffusion models.<n> Comprehensive experiments and human evaluations demonstrate VCtrl effectively enhances controllability and generation quality.
arXiv Detail & Related papers (2025-03-21T09:48:00Z) - VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval [8.908777234657046]
Large-language and vision-language models (LLM/LVLMs) have gained prominence across various domains.<n>Here we propose VideoLights, a novel HD/MR framework addressing these limitations through (i) Convolutional Projection and Feature Refinement modules.<n> Comprehensive experiments on QVHighlights, TVSum, and Charades-STA benchmarks demonstrate state-of-the-art performance.
arXiv Detail & Related papers (2024-12-02T14:45:53Z) - Video Set Distillation: Information Diversification and Temporal Densification [68.85010825225528]
Video textbfsets have two dimensions of redundancies: within-sample and inter-sample redundancies.<n>We are the first to study Video Set Distillation, which synthesizes optimized video data by addressing within-sample and inter-sample redundancies.
arXiv Detail & Related papers (2024-11-28T05:37:54Z) - Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World
Video Super-Resolution [65.91317390645163]
Upscale-A-Video is a text-guided latent diffusion framework for video upscaling.
It ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences.
It also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation.
arXiv Detail & Related papers (2023-12-11T18:54:52Z) - Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation [55.36617538438858]
We propose a novel approach that strengthens the interaction between spatial and temporal perceptions.
We curate a large-scale and open-source video dataset called HD-VG-130M.
arXiv Detail & Related papers (2023-05-18T11:06:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.