SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding
- URL: http://arxiv.org/abs/2503.18943v2
- Date: Thu, 27 Mar 2025 17:34:06 GMT
- Title: SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding
- Authors: Mingze Xu, Mingfei Gao, Shiyu Li, Jiasen Lu, Zhe Gan, Zhengfeng Lai, Meng Cao, Kai Kang, Yinfei Yang, Afshin Dehghan,
- Abstract summary: We introduce SlowFast-LLaVA-1.5 (abbreviated as SF-LLaVA-1.5), a family of video large language models (LLMs)<n>We incorporate the two-stream SlowFast mechanism into a streamlined training pipeline.<n>We perform joint video-image training on a carefully curated data mixture of only publicly available datasets.
- Score: 70.84791600974337
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce SlowFast-LLaVA-1.5 (abbreviated as SF-LLaVA-1.5), a family of video large language models (LLMs) offering a token-efficient solution for long-form video understanding. We incorporate the two-stream SlowFast mechanism into a streamlined training pipeline, and perform joint video-image training on a carefully curated data mixture of only publicly available datasets. Our primary focus is on highly efficient model scales (1B and 3B), demonstrating that even relatively small Video LLMs can achieve state-of-the-art performance on video understanding, meeting the demand for mobile-friendly models. Experimental results demonstrate that SF-LLaVA-1.5 achieves superior performance on a wide range of video and image tasks, with robust results at all model sizes (ranging from 1B to 7B). Notably, SF-LLaVA-1.5 achieves state-of-the-art results in long-form video understanding (e.g., LongVideoBench and MLVU) and excels at small scales across various video benchmarks.
Related papers
- Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models [90.10322077894033]
We introduce Eagle 2.5, a family of frontier vision-language models (VLMs) for long-context multimodal learning.
Our work addresses the challenges in long video comprehension and high-resolution image understanding.
We propose Eagle-Video-110K, a novel dataset that integrates both story-level and clip-level annotations.
arXiv Detail & Related papers (2025-04-21T17:57:28Z) - An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes [85.00111442236499]
This paper presents textbfQuicksviewer, an LMM with new perceiving paradigm that partitions a video of nontemporal density into varying cubes using Gumbel Softmax.
We train the model from a language backbone through three progressive stages, each incorporating lengthy videos on average of 420s/1fps thanks to the perceiving efficiency.
With only 0.8M total video-text samples for training, our model outperforms the direct baseline employing a fixed partitioning strategy by a maximum of 8.72 in accuracy.
arXiv Detail & Related papers (2025-04-21T17:57:21Z) - Slow-Fast Architecture for Video Multi-Modal Large Language Models [42.3957835391319]
Existing methods compress video representations using predefined rules before feeding them into the multi-modal large language model.
We propose a novel slow-fast architecture that naturally circumvents this trade-off, enabling the use of more input frames while preserving spatial details.
Our model significantly outperforms self-attention-only baselines, extending the input capacity from 16 to 128 frames with just a 3% increase in computation.
arXiv Detail & Related papers (2025-04-02T03:24:58Z) - VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling [43.485687038460895]
Long-context video modeling is critical for multimodal large language models (MLLMs)
This paper aims to address this issue from aspects of model architecture, training data, training strategy and evaluation benchmark.
We build a powerful video MLLM named VideoChat-Flash, which shows a leading performance on both mainstream long and short video benchmarks.
arXiv Detail & Related papers (2024-12-31T18:01:23Z) - Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input [34.50993235961505]
Kangaroo is a powerful Video LMM aimed at addressing the challenges of processing long videos.
Data curation system to build a large-scale dataset with high-quality annotations for vision-language pre-training and instruction tuning.
curriculum training pipeline with gradually increasing resolution and number of input frames to accommodate long videos.
arXiv Detail & Related papers (2024-08-28T05:34:14Z) - LongVILA: Scaling Long-Context Visual Language Models for Long Videos [86.28679075537089]
LongVILA is a full-stack solution for long-context visual-language models.<n>LongVILA efficiently extends the number of video frames of VILA from 8 to 2048, achieving 99.8% accuracy in 6,000-frame (more than 1 million tokens) video needle-in-a-haystack.
arXiv Detail & Related papers (2024-08-19T17:48:08Z) - SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models [51.712700398020075]
We propose a training-free video large language model (LLM) that can jointly capture detailed spatial semantics and long-range temporal context.
This is realized by using a two-stream SlowFast design of inputs for Video LLMs to aggregate features from sampled frames in an effective way.
Experimental results show that SF-LLaVA outperforms existing training-free methods on a wide range of video tasks.
arXiv Detail & Related papers (2024-07-22T17:58:04Z) - PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning [78.23573511641548]
Vision-language pre-training has significantly elevated performance across a wide range of image-language applications.
Yet, the pre-training process for video-related tasks demands exceptionally large computational and data resources.
This paper investigates a straight-forward, highly efficient, and resource-light approach to adapting an existing image-language pre-trained model for video understanding.
arXiv Detail & Related papers (2024-04-25T19:29:55Z) - A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames [57.758863967770594]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.<n>We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.