MoCLIP-Lite: Efficient Video Recognition by Fusing CLIP with Motion Vectors
- URL: http://arxiv.org/abs/2509.17084v2
- Date: Thu, 25 Sep 2025 15:09:09 GMT
- Title: MoCLIP-Lite: Efficient Video Recognition by Fusing CLIP with Motion Vectors
- Authors: Binhua Huang, Ni Wang, Arjun Pakrashi, Soumyabrata Dev,
- Abstract summary: MoCLIP-Lite is a simple yet powerful two-stream late fusion framework for efficient video recognition.<n>Our approach combines features from a frozen CLIP image encoder with features from a lightweight, supervised network trained on raw MV.<n>Our method achieves a remarkable 89.2% Top-1 accuracy, significantly outperforming strong zero-shot (65.0%) and MV-only (66.5%) baselines.
- Score: 5.22588980914304
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video action recognition is a fundamental task in computer vision, but state-of-the-art models are often computationally expensive and rely on extensive video pre-training. In parallel, large-scale vision-language models like Contrastive Language-Image Pre-training (CLIP) offer powerful zero-shot capabilities on static images, while motion vectors (MV) provide highly efficient temporal information directly from compressed video streams. To synergize the strengths of these paradigms, we propose MoCLIP-Lite, a simple yet powerful two-stream late fusion framework for efficient video recognition. Our approach combines features from a frozen CLIP image encoder with features from a lightweight, supervised network trained on raw MV. During fusion, both backbones are frozen, and only a tiny Multi-Layer Perceptron (MLP) head is trained, ensuring extreme efficiency. Through comprehensive experiments on the UCF101 dataset, our method achieves a remarkable 89.2% Top-1 accuracy, significantly outperforming strong zero-shot (65.0%) and MV-only (66.5%) baselines. Our work provides a new, highly efficient baseline for video understanding that effectively bridges the gap between large static models and dynamic, low-cost motion cues. Our code and models are available at https://github.com/microa/MoCLIP-Lite.
Related papers
- Towards Effective and Efficient Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval [57.88666884515147]
We propose One-shot video-Clip based Retrieval AuGmentation (OneClip-RAG)<n>OneClip-RAG makes full use of the merits of video clips for augmented video understanding.<n>It is also equipped with a novel query-guided video chunking algorithm.
arXiv Detail & Related papers (2025-12-09T09:40:20Z) - Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model [60.171601995737646]
Mobile-VideoGPT is an efficient multimodal framework for video understanding.<n>It consists of lightweight dual visual encoders, efficient projectors, and a small language model (SLM)<n>Our results show that Mobile-VideoGPT-0.5B can generate up to 46 tokens per second.
arXiv Detail & Related papers (2025-03-27T17:59:58Z) - TinyLLaVA-Video: Towards Smaller LMMs for Video Understanding with Group Resampler [10.92767902813594]
We introduce TinyLLaVA-Video, a lightweight yet powerful video understanding model with approximately 3.6B parameters.<n>The cornerstone of our design is the video-level group resampler, a novel mechanism that significantly reduces and controls the number of visual tokens at the video level.<n>TinyLLaVA-Video demonstrates exceptional efficiency, requiring only one day of training on 8 A100-40G GPUs.
arXiv Detail & Related papers (2025-01-26T13:10:12Z) - Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs [56.040198387038025]
We present a novel prompt-guided visual perception framework (abbreviated as Free Video-LLM) for efficient inference of training-free video LLMs.
Our method effectively reduces the number of visual tokens while maintaining high performance across multiple video question-answering benchmarks.
arXiv Detail & Related papers (2024-10-14T12:35:12Z) - Building an Open-Vocabulary Video CLIP Model with Better Architectures,
Optimization and Data [102.0069667710562]
This paper presents Open-VCLIP++, a framework that adapts CLIP to a strong zero-shot video classifier.
We demonstrate that training Open-VCLIP++ is tantamount to continual learning with zero historical data.
Our approach is evaluated on three widely used action recognition datasets.
arXiv Detail & Related papers (2023-10-08T04:46:43Z) - VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking [57.552798046137646]
Video masked autoencoder (VideoMAE) is a scalable and general self-supervised pre-trainer for building video foundation models.
We successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance.
arXiv Detail & Related papers (2023-03-29T14:28:41Z) - MuLTI: Efficient Video-and-Language Understanding with Text-Guided
MultiWay-Sampler and Multiple Choice Modeling [7.737755720567113]
This paper proposes MuLTI, a highly accurate and efficient video-and-language understanding model.
We design a Text-Guided MultiWay-Sampler based on adapt-pooling residual mapping and self-attention modules.
We also propose a new pretraining task named Multiple Choice Modeling.
arXiv Detail & Related papers (2023-03-10T05:22:39Z) - Frozen CLIP Models are Efficient Video Learners [86.73871814176795]
Video recognition has been dominated by the end-to-end learning paradigm.
Recent advances in Contrastive Vision-Language Pre-training pave the way for a new route for visual recognition tasks.
We present Efficient Video Learning -- an efficient framework for directly training high-quality video recognition models.
arXiv Detail & Related papers (2022-08-06T17:38:25Z) - Motion-Focused Contrastive Learning of Video Representations [94.93666741396444]
Motion as the most distinct phenomenon in a video to involve the changes over time, has been unique and critical to the development of video representation learning.
We present a Motion-focused Contrastive Learning (MCL) method that regards such duet as the foundation.
arXiv Detail & Related papers (2022-01-11T16:15:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.