MobileViCLIP: An Efficient Video-Text Model for Mobile Devices
- URL: http://arxiv.org/abs/2508.07312v1
- Date: Sun, 10 Aug 2025 12:01:58 GMT
- Title: MobileViCLIP: An Efficient Video-Text Model for Mobile Devices
- Authors: Min Yang, Zihan Jia, Zhilin Dai, Sheng Guo, Limin Wang,
- Abstract summary: This paper presents an efficient video-text model that can run on mobile devices with strong zero-shot classification and retrieval capabilities.<n>In terms of inference speed on mobile devices, our MobileViCLIP-Small is 55.4x times faster than InternVideo2-L14 and 6.7x faster than InternVideo2-S14.<n>In terms of zero-shot retrieval performance, our MobileViCLIP-Small obtains similar performance as InternVideo2-L14 and obtains 6.9% better than InternVideo2-S14 on MSR-VTT.
- Score: 24.114050057019078
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Efficient lightweight neural networks are with increasing attention due to their faster reasoning speed and easier deployment on mobile devices. However, existing video pre-trained models still focus on the common ViT architecture with high latency, and few works attempt to build efficient architecture on mobile devices. This paper bridges this gap by introducing temporal structural reparameterization into an efficient image-text model and training it on a large-scale high-quality video-text dataset, resulting in an efficient video-text model that can run on mobile devices with strong zero-shot classification and retrieval capabilities, termed as MobileViCLIP. In particular, in terms of inference speed on mobile devices, our MobileViCLIP-Small is 55.4x times faster than InternVideo2-L14 and 6.7x faster than InternVideo2-S14. In terms of zero-shot retrieval performance, our MobileViCLIP-Small obtains similar performance as InternVideo2-L14 and obtains 6.9\% better than InternVideo2-S14 on MSR-VTT. The code is available at https://github.com/MCG-NJU/MobileViCLIP.
Related papers
- Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device [90.46496321553843]
We present Mobile-O, a compact vision-language-diffusion model that brings unified multimodal intelligence to a mobile device.<n>Its core module, the Mobile Conditioning Projector (MCP), fuses vision-language features with a diffusion generator using depthwise-separable convolutions and layerwise alignment.<n>Running in only 3s per 512x512 image on an iPhone, Mobile-O establishes the first practical framework for real-time unified multimodal understanding and generation on edge devices.
arXiv Detail & Related papers (2026-02-23T18:59:58Z) - MobileI2V: Fast and High-Resolution Image-to-Video on Mobile Devices [42.00270347221752]
We propose MobileI2V, a 270M lightweight diffusion model for real-time image-to-video generation on mobile devices.<n>We design a time-step distillation strategy that compresses the I2V sampling steps from more than 20 to only two without significant quality loss.<n>MobileI2V enables, for the first time, fast 720p image-to-video generation on mobile devices, with quality comparable to existing models.
arXiv Detail & Related papers (2025-11-26T15:09:02Z) - Turbo-VAED: Fast and Stable Transfer of Video-VAEs to Mobile Devices [36.637983575162075]
We propose a low-cost solution that efficiently transfers widely used video VAEs to mobile devices.<n>Our method enables real-time 720p video VAE decoding on mobile devices for the first time.<n>Turbo-VAED achieves a 2.9x speedup in FPS and better reconstruction quality on the iPhone 16 Pro.
arXiv Detail & Related papers (2025-08-12T17:59:46Z) - Taming Diffusion Transformer for Real-Time Mobile Video Generation [72.20660234882594]
Diffusion Transformers (DiT) have shown strong performance in video generation tasks, but their high computational cost makes them impractical for resource-constrained devices like smartphones.<n>We propose a series of novel optimizations to significantly accelerate video generation and enable real-time performance on mobile platforms.
arXiv Detail & Related papers (2025-07-17T17:59:10Z) - Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model [60.171601995737646]
Mobile-VideoGPT is an efficient multimodal framework for video understanding.<n>It consists of lightweight dual visual encoders, efficient projectors, and a small language model (SLM)<n>Our results show that Mobile-VideoGPT-0.5B can generate up to 46 tokens per second.
arXiv Detail & Related papers (2025-03-27T17:59:58Z) - SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding [70.84791600974337]
We introduce SlowFast-LLaVA-1.5 (abbreviated as SF-LLaVA-1.5), a family of video large language models (LLMs)<n>We incorporate the two-stream SlowFast mechanism into a streamlined training pipeline.<n>We perform joint video-image training on a carefully curated data mixture of only publicly available datasets.
arXiv Detail & Related papers (2025-03-24T17:59:07Z) - MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile
Devices [73.46317110474064]
MobileVLM is a competent multimodal vision language model (MMVLM) targeted to run on mobile devices.
It comprises a set of language models at the scale of 1.4B and 2.7B parameters, trained from scratch, a multimodal vision model that is pre-trained in the CLIP fashion.
arXiv Detail & Related papers (2023-12-28T08:21:24Z) - RepViT: Revisiting Mobile CNN From ViT Perspective [67.05569159984691]
lightweight Vision Transformers (ViTs) demonstrate superior performance and lower latency, compared with lightweight Convolutional Neural Networks (CNNs)
In this study, we revisit the efficient design of lightweight CNNs from ViT perspective and emphasize their promising prospect for mobile devices.
arXiv Detail & Related papers (2023-07-18T14:24:33Z) - Rethinking Vision Transformers for MobileNet Size and Speed [58.01406896628446]
We propose a novel supernet with low latency and high parameter efficiency.
We also introduce a novel fine-grained joint search strategy for transformer models.
This work demonstrate that properly designed and optimized vision transformers can achieve high performance even with MobileNet-level size and speed.
arXiv Detail & Related papers (2022-12-15T18:59:12Z) - EfficientFormer: Vision Transformers at MobileNet Speed [43.93223983817965]
Vision Transformers (ViT) have shown rapid progress in computer vision tasks, achieving promising results on various benchmarks.
ViT-based models are generally times slower than lightweight convolutional networks.
Recent efforts try to reduce the complexity of ViT through network architecture search or hybrid design with MobileNet block, yet the inference speed is still unsatisfactory.
arXiv Detail & Related papers (2022-06-02T17:51:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.