Dynamic-I2V: Exploring Image-to-Video Generation Models via Multimodal LLM
- URL: http://arxiv.org/abs/2505.19901v3
- Date: Tue, 03 Jun 2025 07:32:50 GMT
- Title: Dynamic-I2V: Exploring Image-to-Video Generation Models via Multimodal LLM
- Authors: Peng Liu, Xiaoming Ren, Fengkai Liu, Qingsong Xie, Quanlong Zheng, Yanhao Zhang, Haonan Lu, Yujiu Yang,
- Abstract summary: We present Dynamic-I2V, an innovative framework that integrates Multimodal Large Language Models (MLLMs) to jointly encode visual and textual conditions for a diffusion transformer (DiT) architecture.<n>We show that Dynamic-I2V attains state-of-the-art performance in image-to-video generation, particularly revealing significant improvements of 42.5%, 7.9%, and 11.8% in dynamic range, controllability, and quality, respectively.
- Score: 34.23005193437701
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in image-to-video (I2V) generation have shown promising performance in conventional scenarios. However, these methods still encounter significant challenges when dealing with complex scenes that require a deep understanding of nuanced motion and intricate object-action relationships. To address these challenges, we present Dynamic-I2V, an innovative framework that integrates Multimodal Large Language Models (MLLMs) to jointly encode visual and textual conditions for a diffusion transformer (DiT) architecture. By leveraging the advanced multimodal understanding capabilities of MLLMs, our model significantly improves motion controllability and temporal coherence in synthesized videos. The inherent multimodality of Dynamic-I2V further enables flexible support for diverse conditional inputs, extending its applicability to various downstream generation tasks. Through systematic analysis, we identify a critical limitation in current I2V benchmarks: a significant bias towards favoring low-dynamic videos, stemming from an inadequate balance between motion complexity and visual quality metrics. To resolve this evaluation gap, we propose DIVE - a novel assessment benchmark specifically designed for comprehensive dynamic quality measurement in I2V generation. In conclusion, extensive quantitative and qualitative experiments confirm that Dynamic-I2V attains state-of-the-art performance in image-to-video generation, particularly revealing significant improvements of 42.5%, 7.9%, and 11.8% in dynamic range, controllability, and quality, respectively, as assessed by the DIVE metric in comparison to existing methods.
Related papers
- MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z) - Toward Rich Video Human-Motion2D Generation [16.58311138197227]
We introduce a new large-scale rich video human motion 2D dataset (Motion2D-Video-150K) comprising 150,000 video sequences.<n>Motion2D-Video-150K features a balanced distribution of diverse single-character and, crucially, double-character interactive actions.<n>We propose a novel diffusion-based rich video human motion2D generation model (RVHM2D) model.
arXiv Detail & Related papers (2025-06-17T11:45:33Z) - Seedance 1.0: Exploring the Boundaries of Video Generation Models [71.26796999246068]
Seedance 1.0 is a high-performance and inference-efficient video foundation generation model.<n>It integrates multi-source curation data augmented with precision and meaningful video captioning.<n>Seedance 1.0 can generate a 5-second video at 1080p resolution only with 41.4 seconds ( NVIDIA-L20)
arXiv Detail & Related papers (2025-06-10T17:56:11Z) - Enhancing Motion Dynamics of Image-to-Video Models via Adaptive Low-Pass Guidance [70.12690940725092]
adaptive low-pass guidance (ALG) is a simple fix to the I2V model sampling procedure to generate more dynamic videos.<n>Under VBench-I2V test suite, ALG achieves an average improvement of 36% in dynamic degree without a significant drop in video quality or image fidelity.
arXiv Detail & Related papers (2025-06-10T05:23:46Z) - DiVE: Efficient Multi-View Driving Scenes Generation Based on Video Diffusion Transformer [56.98400572837792]
DiVE produces high-fidelity, temporally coherent, and cross-view consistent multi-view videos.<n>These innovations collectively achieve a 2.62x speedup with minimal quality degradation.
arXiv Detail & Related papers (2025-04-28T09:20:50Z) - RAGME: Retrieval Augmented Video Generation for Enhanced Motion Realism [73.38167494118746]
We propose a framework to improve the realism of motion in generated videos.<n>We advocate for the incorporation of a retrieval mechanism during the generation phase.<n>Our pipeline is designed to apply to any text-to-video diffusion model.
arXiv Detail & Related papers (2025-04-09T08:14:05Z) - Extrapolating and Decoupling Image-to-Video Generation Models: Motion Modeling is Easier Than You Think [24.308538128761985]
Image-to-Video (I2V) generation aims to synthesize a video clip according to a given image and condition (e.g., text)<n>Key challenge of this task lies in simultaneously generating natural motions while preserving the original appearance of the images.<n>We propose a novel Extrapolating and Decoupling framework, which introduces model merging techniques to the I2V domain for the first time.
arXiv Detail & Related papers (2025-03-02T16:06:16Z) - VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval [8.908777234657046]
Large-language and vision-language models (LLM/LVLMs) have gained prominence across various domains.<n>Here we propose VideoLights, a novel HD/MR framework addressing these limitations through (i) Convolutional Projection and Feature Refinement modules.<n> Comprehensive experiments on QVHighlights, TVSum, and Charades-STA benchmarks demonstrate state-of-the-art performance.
arXiv Detail & Related papers (2024-12-02T14:45:53Z) - Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping [45.584695790489484]
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across multimodal tasks.<n>We introduce a dynamic multimodal evaluation protocol called Vision-Language Bootstrapping (VLB)<n>VLB provides a robust and comprehensive assessment for LVLMs with reduced data contamination and flexible complexity.
arXiv Detail & Related papers (2024-10-11T10:33:51Z) - Evaluation of Text-to-Video Generation Models: A Dynamics Perspective [94.2662603491163]
Existing evaluation protocols primarily focus on temporal consistency and content continuity.
We propose an effective evaluation protocol, termed DEVIL, which centers on the dynamics dimension to evaluate T2V models.
arXiv Detail & Related papers (2024-07-01T08:51:22Z) - CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion [58.15403987979496]
CREMA is a generalizable, highly efficient, and modular modality-fusion framework for video reasoning.<n>We propose a novel progressive multimodal fusion design supported by a lightweight fusion module and modality-sequential training strategy.<n>We validate our method on 7 video-language reasoning tasks assisted by diverse modalities, including VideoQA and Video-Audio/3D/Touch/Thermal QA.
arXiv Detail & Related papers (2024-02-08T18:27:22Z) - E2HQV: High-Quality Video Generation from Event Camera via
Theory-Inspired Model-Aided Deep Learning [53.63364311738552]
Bio-inspired event cameras or dynamic vision sensors are capable of capturing per-pixel brightness changes (called event-streams) in high temporal resolution and high dynamic range.
It calls for events-to-video (E2V) solutions which take event-streams as input and generate high quality video frames for intuitive visualization.
We propose textbfE2HQV, a novel E2V paradigm designed to produce high-quality video frames from events.
arXiv Detail & Related papers (2024-01-16T05:10:50Z) - DiverseMotion: Towards Diverse Human Motion Generation via Discrete
Diffusion [70.33381660741861]
We present DiverseMotion, a new approach for synthesizing high-quality human motions conditioned on textual descriptions.
We show that our DiverseMotion achieves the state-of-the-art motion quality and competitive motion diversity.
arXiv Detail & Related papers (2023-09-04T05:43:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.