Related papers: SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

URL: http://arxiv.org/abs/2407.15841v1
Date: Mon, 22 Jul 2024 17:58:04 GMT
Title: SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
Authors: Mingze Xu, Mingfei Gao, Zhe Gan, Hong-You Chen, Zhengfeng Lai, Haiming Gang, Kai Kang, Afshin Dehghan,
Abstract summary: We propose a training-free video large language model (LLM) that can jointly capture the detailed spatial semantics and long-range temporal context. This is realized by using a two-stream SlowFast design of inputs for Video LLMs. Experimental results show that SF-LLaVA outperforms existing training-free methods on a wide range of video tasks.
Score: 51.712700398020075
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose SlowFast-LLaVA (or SF-LLaVA for short), a training-free video large language model (LLM) that can jointly capture the detailed spatial semantics and long-range temporal context without exceeding the token budget of commonly used LLMs. This is realized by using a two-stream SlowFast design of inputs for Video LLMs to aggregate features from sampled video frames in an effective way. Specifically, the Slow pathway extracts features at a low frame rate while keeping as many spatial details as possible (e.g., with 24x24 tokens), and the Fast pathway operates on a high frame rate but uses a larger spatial pooling stride (e.g., downsampling 6x) to focus on the motion cues. As a result, this design allows us to adequately capture both spatial and temporal features that are beneficial for understanding details along the video. Experimental results show that SF-LLaVA outperforms existing training-free methods on a wide range of video tasks. On some benchmarks, it achieves comparable or even better performance compared to state-of-the-art Video LLMs that are fine-tuned on video datasets.

Related papers

Flash-VStream: Efficient Real-Time Understanding for Long Video Streams [64.25549822010372]
Flash-VStream is a video language model capable of processing extremely long videos and responding to user queries in real time.<n>Compared to existing models, Flash-VStream achieves significant reductions in inference latency.
arXiv Detail & Related papers (2025-06-30T13:17:49Z)
Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs [25.13186579764434]
We introduce Sparse-to-Dense (StD), a novel decoding strategy that integrates two distinct modules.<n>StD is a tuning-free, plug-and-play solution that achieves up to a 1.94$times$ walltime speedup in video processing.
arXiv Detail & Related papers (2025-05-25T14:09:28Z)
Slow-Fast Architecture for Video Multi-Modal Large Language Models [42.3957835391319]
Existing methods compress video representations using predefined rules before feeding them into the multi-modal large language model. We propose a novel slow-fast architecture that naturally circumvents this trade-off, enabling the use of more input frames while preserving spatial details. Our model significantly outperforms self-attention-only baselines, extending the input capacity from 16 to 128 frames with just a 3% increase in computation.
arXiv Detail & Related papers (2025-04-02T03:24:58Z)
FastVID: Dynamic Density Pruning for Fast Video Large Language Models [38.267065642416554]
We propose Dynamic Density Pruning for Fast Video LLMs termed FastVID. FastVID partitions videos into temporally ordered segments to preserve temporal structure and applies a density-based token pruning strategy to maintain essential visual information. Our method significantly reduces computational overhead while maintaining temporal and visual integrity.
arXiv Detail & Related papers (2025-03-14T08:33:08Z)
Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs. We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z)
SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation [153.46240555355408]
SlowFast-VGen is a novel dual-speed learning system for action-driven long video generation. Our approach incorporates a conditional video diffusion model for the slow learning of world dynamics. We propose a slow-fast learning loop algorithm that seamlessly integrates the inner fast learning loop into the outer slow learning loop.
arXiv Detail & Related papers (2024-10-30T17:55:52Z)
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding [65.46303012350207]
LongVU is an adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos. We leverage DINOv2 features to remove redundant frames that exhibit high similarity. We perform spatial token reduction across frames based on their temporal dependencies.
arXiv Detail & Related papers (2024-10-22T21:21:37Z)
Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs [56.040198387038025]
We present a novel prompt-guided visual perception framework (abbreviated as Free Video-LLM) for efficient inference of training-free video LLMs. Our method effectively reduces the number of visual tokens while maintaining high performance across multiple video question-answering benchmarks.
arXiv Detail & Related papers (2024-10-14T12:35:12Z)
Slot-VLM: SlowFast Slots for Video-Language Modeling [39.474247695753725]
Video-Language Models (VLMs) are powered by the advancements in Large Language Models (LLMs) In this work, we introduce Slot-VLM, a novel framework designed to generate semantically decomposed video tokens. Our experimental results demonstrate the effectiveness of our Slot-VLM, which achieves the state-of-the-art performance on video question-answering.
arXiv Detail & Related papers (2024-02-20T15:30:09Z)
A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames [54.90226700939778]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion. We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z)
VideoINR: Learning Video Implicit Neural Representation for Continuous Space-Time Super-Resolution [75.79379734567604]
We show that Video Implicit Neural Representation (VideoINR) can be decoded to videos of arbitrary spatial resolution and frame rate. We show that VideoINR achieves competitive performances with state-of-the-art STVSR methods on common up-sampling scales.
arXiv Detail & Related papers (2022-06-09T17:45:49Z)
Frame-wise Action Representations for Long Videos via Sequence Contrastive Learning [44.412145665354736]
We introduce a novel contrastive action representation learning framework to learn frame-wise action representations. Inspired by the recent progress of self-supervised learning, we present a novel sequence contrastive loss (SCL) applied on two correlated views. Our approach also shows outstanding performance on video alignment and fine-grained frame retrieval tasks.
arXiv Detail & Related papers (2022-03-28T17:59:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.