Slot-VLM: SlowFast Slots for Video-Language Modeling
- URL: http://arxiv.org/abs/2402.13088v1
- Date: Tue, 20 Feb 2024 15:30:09 GMT
- Title: Slot-VLM: SlowFast Slots for Video-Language Modeling
- Authors: Jiaqi Xu, Cuiling Lan, Wenxuan Xie, Xuejin Chen, Yan Lu
- Abstract summary: Video-Language Models (VLMs) are powered by the advancements in Large Language Models (LLMs)
In this work, we introduce Slot-VLM, a novel framework designed to generate semantically decomposed video tokens.
Our experimental results demonstrate the effectiveness of our Slot-VLM, which achieves the state-of-the-art performance on video question-answering.
- Score: 39.474247695753725
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video-Language Models (VLMs), powered by the advancements in Large Language
Models (LLMs), are charting new frontiers in video understanding. A pivotal
challenge is the development of an efficient method to encapsulate video
content into a set of representative tokens to align with LLMs. In this work,
we introduce Slot-VLM, a novel framework designed to generate semantically
decomposed video tokens, in terms of object-wise and event-wise visual
representations, to facilitate LLM inference. Particularly, we design a
SlowFast Slots module, i.e., SF-Slots, that adaptively aggregates the dense
video tokens from the CLIP vision encoder to a set of representative slots. In
order to take into account both the spatial object details and the varied
temporal dynamics, SF-Slots is built with a dual-branch structure. The
Slow-Slots branch focuses on extracting object-centric slots from features at
high spatial resolution but low (slow) frame sample rate, emphasizing detailed
object information. Conversely, Fast-Slots branch is engineered to learn
event-centric slots from high temporal sample rate but low spatial resolution
features. These complementary slots are combined to form the vision context,
serving as the input to the LLM for efficient question answering. Our
experimental results demonstrate the effectiveness of our Slot-VLM, which
achieves the state-of-the-art performance on video question-answering.
Related papers
- TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations [23.188508465235717]
We propose two strategies to enhance the model's capability in video understanding tasks.
The first approach focuses on the enhancement of Rotary Position Embedding (RoPE) with Temporal-Aware Dual RoPE.
The second approach involves enhancing the Attention Mask with the Frame-wise Block Causal Attention Mask.
arXiv Detail & Related papers (2024-09-05T02:54:17Z) - SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models [51.712700398020075]
We propose a training-free video large language model (LLM) that can jointly capture detailed spatial semantics and long-range temporal context.
This is realized by using a two-stream SlowFast design of inputs for Video LLMs to aggregate features from sampled frames in an effective way.
Experimental results show that SF-LLaVA outperforms existing training-free methods on a wide range of video tasks.
arXiv Detail & Related papers (2024-07-22T17:58:04Z) - ST-LLM: Large Language Models Are Effective Temporal Learners [58.79456373423189]
Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation.
How to effectively encode and understand videos in video-based dialogue systems remains to be solved.
We propose ST-LLM, an effective video-LLM baseline with spatial-temporal sequence modeling inside LLM.
arXiv Detail & Related papers (2024-03-30T10:11:26Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs [55.8550939439138]
Vision-Language Models (VLMs) have shown immense potential by integrating large language models with vision systems.
These models face challenges in the fundamental computer vision task of object localisation, due to their training on multimodal data containing mostly captions.
We introduce an input-agnostic Positional Insert (PIN), a learnable spatial prompt, containing a minimal set of parameters that are slid inside the frozen VLM.
Our PIN module is trained with a simple next-token prediction task on synthetic data without requiring the introduction of new output heads.
arXiv Detail & Related papers (2024-02-13T18:39:18Z) - TAM-VT: Transformation-Aware Multi-scale Video Transformer for Segmentation and Tracking [33.75267864844047]
Video Object (VOS) has emerged as an increasingly important problem with availability of larger datasets and more complex and realistic settings.
We propose a novel, clip-based DETR-style encoder-decoder architecture, which focuses on systematically analyzing and addressing aforementioned challenges.
Specifically, we propose a novel transformation-aware loss that focuses learning on portions of the video where an object undergoes significant deformations.
arXiv Detail & Related papers (2023-12-13T21:02:03Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.