MobileI2V: Fast and High-Resolution Image-to-Video on Mobile Devices
- URL: http://arxiv.org/abs/2511.21475v1
- Date: Wed, 26 Nov 2025 15:09:02 GMT
- Title: MobileI2V: Fast and High-Resolution Image-to-Video on Mobile Devices
- Authors: Shuai Zhang, Bao Tang, Siyuan Yu, Yueting Zhu, Jingfeng Yao, Ya Zou, Shanglin Yuan, Li Yu, Wenyu Liu, Xinggang Wang,
- Abstract summary: We propose MobileI2V, a 270M lightweight diffusion model for real-time image-to-video generation on mobile devices.<n>We design a time-step distillation strategy that compresses the I2V sampling steps from more than 20 to only two without significant quality loss.<n>MobileI2V enables, for the first time, fast 720p image-to-video generation on mobile devices, with quality comparable to existing models.
- Score: 42.00270347221752
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, video generation has witnessed rapid advancements, drawing increasing attention to image-to-video (I2V) synthesis on mobile devices. However, the substantial computational complexity and slow generation speed of diffusion models pose significant challenges for real-time, high-resolution video generation on resource-constrained mobile devices. In this work, we propose MobileI2V, a 270M lightweight diffusion model for real-time image-to-video generation on mobile devices. The core lies in: (1) We analyzed the performance of linear attention modules and softmax attention modules on mobile devices, and proposed a linear hybrid architecture denoiser that balances generation efficiency and quality. (2) We design a time-step distillation strategy that compresses the I2V sampling steps from more than 20 to only two without significant quality loss, resulting in a 10-fold increase in generation speed. (3) We apply mobile-specific attention optimizations that yield a 2-fold speed-up for attention operations during on-device inference. MobileI2V enables, for the first time, fast 720p image-to-video generation on mobile devices, with quality comparable to existing models. Under one-step conditions, the generation speed of each frame of 720p video is less than 100 ms. Our code is available at: https://github.com/hustvl/MobileI2V.
Related papers
- Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device [90.46496321553843]
We present Mobile-O, a compact vision-language-diffusion model that brings unified multimodal intelligence to a mobile device.<n>Its core module, the Mobile Conditioning Projector (MCP), fuses vision-language features with a diffusion generator using depthwise-separable convolutions and layerwise alignment.<n>Running in only 3s per 512x512 image on an iPhone, Mobile-O establishes the first practical framework for real-time unified multimodal understanding and generation on edge devices.
arXiv Detail & Related papers (2026-02-23T18:59:58Z) - MobileViCLIP: An Efficient Video-Text Model for Mobile Devices [24.114050057019078]
This paper presents an efficient video-text model that can run on mobile devices with strong zero-shot classification and retrieval capabilities.<n>In terms of inference speed on mobile devices, our MobileViCLIP-Small is 55.4x times faster than InternVideo2-L14 and 6.7x faster than InternVideo2-S14.<n>In terms of zero-shot retrieval performance, our MobileViCLIP-Small obtains similar performance as InternVideo2-L14 and obtains 6.9% better than InternVideo2-S14 on MSR-VTT.
arXiv Detail & Related papers (2025-08-10T12:01:58Z) - Taming Diffusion Transformer for Efficient Mobile Video Generation in Seconds [91.56929670753226]
Diffusion Transformers (DiT) have shown strong performance in video generation tasks, but their high computational cost makes them impractical for resource-constrained devices like smartphones.<n>We propose a series of novel optimizations to significantly accelerate video generation and enable practical deployment on mobile platforms.
arXiv Detail & Related papers (2025-07-17T17:59:10Z) - CompactFlowNet: Efficient Real-time Optical Flow Estimation on Mobile Devices [19.80162591240214]
We present CompactFlowNet, the first real-time mobile neural network for optical flow prediction.<n>Optical flow serves as a fundamental building block for various video-related tasks, such as video restoration, motion estimation, video stabilization, object tracking, action recognition, and video generation.
arXiv Detail & Related papers (2024-12-17T19:06:12Z) - SnapGen-V: Generating a Five-Second Video within Five Seconds on a Mobile Device [61.42406720183769]
We propose a comprehensive acceleration framework to bring the power of the large-scale video diffusion model to the hands of edge users.<n>Our model, with only 0.6B parameters, can generate a 5-second video on an iPhone 16 PM within 5 seconds.
arXiv Detail & Related papers (2024-12-13T18:59:56Z) - SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training [77.681908636429]
Text-to-image (T2I) models face several limitations, including large model sizes, slow, and low-quality generation on mobile devices.<n>This paper aims to develop an extremely small and fast T2I model that generates high-resolution and high-quality images on mobile platforms.
arXiv Detail & Related papers (2024-12-12T18:59:53Z) - Video Mobile-Former: Video Recognition with Efficient Global
Spatial-temporal Modeling [125.95527079960725]
Transformer-based models have achieved top performance on major video recognition benchmarks.
Video Mobile-Former is the first Transformer-based video model which constrains the computational budget within 1G FLOPs.
arXiv Detail & Related papers (2022-08-25T17:59:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.