Related papers: On-device Sora: Enabling Training-Free Diffusion-based Text-to-Video Generation for Mobile Devices

On-device Sora: Enabling Training-Free Diffusion-based Text-to-Video Generation for Mobile Devices

URL: http://arxiv.org/abs/2503.23796v2
Date: Tue, 01 Apr 2025 02:33:18 GMT
Title: On-device Sora: Enabling Training-Free Diffusion-based Text-to-Video Generation for Mobile Devices
Authors: Bosung Kim, Kyuhwan Lee, Isu Jeong, Jungmin Cheon, Yeojin Lee, Seulki Lee,
Abstract summary: We present On-device Sora, the first model training-free solution for diffusion-based on-device text-to-video generation.<n>We implement On-device Sora on the iPhone 15 Pro, and the experimental evaluations show that it is capable of generating high-quality videos on the device.
Score: 3.034710104407876
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present On-device Sora, the first model training-free solution for diffusion-based on-device text-to-video generation that operates efficiently on smartphone-grade devices. To address the challenges of diffusion-based text-to-video generation on computation- and memory-limited mobile devices, the proposed On-device Sora applies three novel techniques to pre-trained video generative models. First, Linear Proportional Leap (LPL) reduces the excessive denoising steps required in video diffusion through an efficient leap-based approach. Second, Temporal Dimension Token Merging (TDTM) minimizes intensive token-processing computation in attention layers by merging consecutive tokens along the temporal dimension. Third, Concurrent Inference with Dynamic Loading (CI-DL) dynamically partitions large models into smaller blocks and loads them into memory for concurrent model inference, effectively addressing the challenges of limited device memory. We implement On-device Sora on the iPhone 15 Pro, and the experimental evaluations show that it is capable of generating high-quality videos on the device, comparable to those produced by high-end GPUs. These results show that On-device Sora enables efficient and high-quality video generation on resource-constrained mobile devices. We envision the proposed On-device Sora as a significant first step toward democratizing state-of-the-art generative technologies, enabling video generation on commodity mobile and embedded devices without resource-intensive re-training for model optimization (compression). The code implementation is available at a GitHub repository(https://github.com/eai-lab/On-device-Sora).

Related papers

READ: Real-time and Efficient Asynchronous Diffusion for Audio-driven Talking Head Generation [55.58089937219475]
We propose READ, the first real-time diffusion-transformer-based talking head generation framework.<n>Our approach first learns highly compressed video latent space via a VAE, significantly reducing the token count to speech generation.<n>We show that READ outperforms state-of-the-art methods by generating competitive talking head videos with significantly reduced runtime.
arXiv Detail & Related papers (2025-08-05T13:57:03Z)
Taming Diffusion Transformer for Real-Time Mobile Video Generation [72.20660234882594]
Diffusion Transformers (DiT) have shown strong performance in video generation tasks, but their high computational cost makes them impractical for resource-constrained devices like smartphones.<n>We propose a series of novel optimizations to significantly accelerate video generation and enable real-time performance on mobile platforms.
arXiv Detail & Related papers (2025-07-17T17:59:10Z)
Training-Free Efficient Video Generation via Dynamic Token Carving [54.52061549312799]
Jenga is an inference pipeline that combines dynamic attention carving with progressive resolution generation.<n>As a plug-and-play solution, Jenga enables practical, high-quality video generation on modern hardware.
arXiv Detail & Related papers (2025-05-22T16:21:32Z)
Scaling On-Device GPU Inference for Large Generative Models [5.938112995772544]
ML Drift is an optimized framework that extends the capabilities of state-of-the-art GPU-accelerated inference engines. Our GPU-accelerated ML/AI inference engine achieves an order-of-magnitude performance improvement relative to existing open-source GPU inference engines.
arXiv Detail & Related papers (2025-05-01T00:44:13Z)
T2VShield: Model-Agnostic Jailbreak Defense for Text-to-Video Models [88.63040835652902]
Text to video models are vulnerable to jailbreak attacks, where specially crafted prompts bypass safety mechanisms and lead to the generation of harmful or unsafe content. We propose T2VShield, a comprehensive and model agnostic defense framework designed to protect text to video models from jailbreak threats. Our method systematically analyzes the input, model, and output stages to identify the limitations of existing defenses.
arXiv Detail & Related papers (2025-04-22T01:18:42Z)
On-device Sora: Enabling Training-Free Diffusion-based Text-to-Video Generation for Mobile Devices [3.034710104407876]
We present On-device Sora, the first model training-free solution for diffusion-based on-device text-to-video generation.<n>We implement On-device Sora on the iPhone 15 Pro, and the experimental evaluations show that it is capable of generating high-quality videos on the device.
arXiv Detail & Related papers (2025-02-05T05:42:29Z)
V^3: Viewing Volumetric Videos on Mobiles via Streamable 2D Dynamic Gaussians [53.614560799043545]
V3 (Viewing Volumetric Videos) is a novel approach that enables high-quality mobile rendering through the streaming of dynamic Gaussians. Our key innovation is to view dynamic 3DGS as 2D videos, facilitating the use of hardware video codecs. As the first to stream dynamic Gaussians on mobile devices, our companion player offers users an unprecedented volumetric video experience.
arXiv Detail & Related papers (2024-09-20T16:54:27Z)
RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks [93.18404922542702]
We present a novel video generative model designed to address long-term spatial and temporal dependencies. Our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks. Our model synthesizes high-fidelity video clips at a resolution of $256times256$ pixels, with durations extending to more than $5$ seconds at a frame rate of 30 fps.
arXiv Detail & Related papers (2024-01-11T16:48:44Z)
Slimmable Encoders for Flexible Split DNNs in Bandwidth and Resource Constrained IoT Systems [12.427821850039448]
We propose a novel split computing approach based on slimmable ensemble encoders. The key advantage of our design is the ability to adapt computational load and transmitted data size in real-time with minimal overhead and time. Our model outperforms existing solutions in terms of compression efficacy and execution time, especially in the context of weak mobile devices.
arXiv Detail & Related papers (2023-06-22T06:33:12Z)
Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation [55.36617538438858]
We propose a novel approach that strengthens the interaction between spatial and temporal perceptions. We curate a large-scale and open-source video dataset called HD-VG-130M.
arXiv Detail & Related papers (2023-05-18T11:06:15Z)
On-Device Training Under 256KB Memory [62.95579393237751]
We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory. Our framework is the first solution to enable tiny on-device training of convolutional neural networks under 256KB and 1MB Flash.
arXiv Detail & Related papers (2022-06-30T17:59:08Z)
Perceptron Synthesis Network: Rethinking the Action Scale Variances in Videos [48.57686258913474]
Video action recognition has been partially addressed by the CNNs stacking of fixed-size 3D kernels. We propose to learn the optimal-scale kernels from the data. An textitaction perceptron synthesizer is proposed to generate the kernels from a bag of fixed-size kernels.
arXiv Detail & Related papers (2020-07-22T14:22:29Z)
Real-Time Video Inference on Edge Devices via Adaptive Model Streaming [9.101956442584251]
Real-time video inference on edge devices like mobile phones and drones is challenging due to the high cost of Deep Neural Networks. We present Adaptive Model Streaming (AMS), a new approach to improving performance of efficient lightweight models for video inference on edge devices.
arXiv Detail & Related papers (2020-06-11T17:25:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.