Turbo-VAED: Fast and Stable Transfer of Video-VAEs to Mobile Devices
- URL: http://arxiv.org/abs/2508.09136v1
- Date: Tue, 12 Aug 2025 17:59:46 GMT
- Title: Turbo-VAED: Fast and Stable Transfer of Video-VAEs to Mobile Devices
- Authors: Ya Zou, Jingfeng Yao, Siyuan Yu, Shuai Zhang, Wenyu Liu, Xinggang Wang,
- Abstract summary: We propose a low-cost solution that efficiently transfers widely used video VAEs to mobile devices.<n>Our method enables real-time 720p video VAE decoding on mobile devices for the first time.<n>Turbo-VAED achieves a 2.9x speedup in FPS and better reconstruction quality on the iPhone 16 Pro.
- Score: 36.637983575162075
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: There is a growing demand for deploying large generative AI models on mobile devices. For recent popular video generative models, however, the Variational AutoEncoder (VAE) represents one of the major computational bottlenecks. Both large parameter sizes and mismatched kernels cause out-of-memory errors or extremely slow inference on mobile devices. To address this, we propose a low-cost solution that efficiently transfers widely used video VAEs to mobile devices. (1) We analyze redundancy in existing VAE architectures and get empirical design insights. By integrating 3D depthwise separable convolutions into our model, we significantly reduce the number of parameters. (2) We observe that the upsampling techniques in mainstream video VAEs are poorly suited to mobile hardware and form the main bottleneck. In response, we propose a decoupled 3D pixel shuffle scheme that slashes end-to-end delay. Building upon these, we develop a universal mobile-oriented VAE decoder, Turbo-VAED. (3) We propose an efficient VAE decoder training method. Since only the decoder is used during deployment, we distill it to Turbo-VAED instead of retraining the full VAE, enabling fast mobile adaptation with minimal performance loss. To our knowledge, our method enables real-time 720p video VAE decoding on mobile devices for the first time. This approach is widely applicable to most video VAEs. When integrated into four representative models, with training cost as low as $95, it accelerates original VAEs by up to 84.5x at 720p resolution on GPUs, uses as low as 17.5% of original parameter count, and retains 96.9% of the original reconstruction quality. Compared to mobile-optimized VAEs, Turbo-VAED achieves a 2.9x speedup in FPS and better reconstruction quality on the iPhone 16 Pro. The code and models will soon be available at https://github.com/hustvl/Turbo-VAED.
Related papers
- Helios: Real Real-Time Long Video Generation Model [33.34372252025333]
Helios is a 14B autoregressive diffusion model with a unified input representation that supports T2V, I2V, and V2V tasks.<n>Helios consistently outperforms prior methods on both short- and long-video generation.<n>We plan to release the code, base model, and distilled model to support further development by the community.
arXiv Detail & Related papers (2026-03-04T18:45:21Z) - MobileViCLIP: An Efficient Video-Text Model for Mobile Devices [24.114050057019078]
This paper presents an efficient video-text model that can run on mobile devices with strong zero-shot classification and retrieval capabilities.<n>In terms of inference speed on mobile devices, our MobileViCLIP-Small is 55.4x times faster than InternVideo2-L14 and 6.7x faster than InternVideo2-S14.<n>In terms of zero-shot retrieval performance, our MobileViCLIP-Small obtains similar performance as InternVideo2-L14 and obtains 6.9% better than InternVideo2-S14 on MSR-VTT.
arXiv Detail & Related papers (2025-08-10T12:01:58Z) - Taming Diffusion Transformer for Real-Time Mobile Video Generation [72.20660234882594]
Diffusion Transformers (DiT) have shown strong performance in video generation tasks, but their high computational cost makes them impractical for resource-constrained devices like smartphones.<n>We propose a series of novel optimizations to significantly accelerate video generation and enable real-time performance on mobile platforms.
arXiv Detail & Related papers (2025-07-17T17:59:10Z) - LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models [17.29580459404157]
We propose LeanVAE, a novel and ultra-efficient Video VAE framework.<n>Our model offers up to 50x fewer FLOPs and 44x faster inference speed.<n>Our experiments validate LeanVAE's superiority in video reconstruction and generation.
arXiv Detail & Related papers (2025-03-18T14:58:59Z) - Large Motion Video Autoencoding with Cross-modal Video VAE [52.13379965800485]
Video Variational Autoencoder (VAE) is essential for reducing video redundancy and facilitating efficient video generation.<n>Existing Video VAEs have begun to address temporal compression; however, they often suffer from inadequate reconstruction performance.<n>We present a novel and powerful video autoencoder capable of high-fidelity video encoding.
arXiv Detail & Related papers (2024-12-23T18:58:24Z) - REDUCIO! Generating 1K Video within 16 Seconds using Extremely Compressed Motion Latents [110.41795676048835]
One crucial obstacle for large-scale applications is the expensive training and inference cost.<n>We argue that videos contain significantly more redundant information than images, allowing them to be encoded with very few motion latents.<n>We design an image-conditioned VAE that projects videos into extremely compressed latent space and decode them based on content images.
arXiv Detail & Related papers (2024-11-20T18:59:52Z) - V^3: Viewing Volumetric Videos on Mobiles via Streamable 2D Dynamic Gaussians [53.614560799043545]
V3 (Viewing Volumetric Videos) is a novel approach that enables high-quality mobile rendering through the streaming of dynamic Gaussians.
Our key innovation is to view dynamic 3DGS as 2D videos, facilitating the use of hardware video codecs.
As the first to stream dynamic Gaussians on mobile devices, our companion player offers users an unprecedented volumetric video experience.
arXiv Detail & Related papers (2024-09-20T16:54:27Z) - Towards Real-time Video Compressive Sensing on Mobile Devices [18.96331666620252]
Video Snapshot Compressive Imaging (SCI) uses a low-speed 2D camera to capture high-speed scenes as snapshot compressed measurements.
We present an effective approach for video SCI reconstruction, dubbed MobileSCI, which can run at real-time speed on the mobile devices.
arXiv Detail & Related papers (2024-08-14T13:03:31Z) - FastViT: A Fast Hybrid Vision Transformer using Structural
Reparameterization [14.707312504365376]
We introduce FastViT, a hybrid vision transformer architecture that obtains the state-of-the-art latency-accuracy trade-off.
We show that our model is 3.5x faster than CMT, 4.9x faster than EfficientNet, and 1.9x faster than ConvNeXt on a mobile device for the same accuracy on the ImageNet dataset.
arXiv Detail & Related papers (2023-03-24T17:58:32Z) - RT3D: Achieving Real-Time Execution of 3D Convolutional Neural Networks
on Mobile Devices [57.877112704841366]
This paper proposes RT3D, a model compression and mobile acceleration framework for 3D CNNs.
For the first time, real-time execution of 3D CNNs is achieved on off-the-shelf mobiles.
arXiv Detail & Related papers (2020-07-20T02:05:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.