Do Video Language Models Really Know Where to Look? Diagnosing Attention Failures in Video Language Models
- URL: http://arxiv.org/abs/2509.01167v1
- Date: Mon, 01 Sep 2025 06:39:08 GMT
- Title: Do Video Language Models Really Know Where to Look? Diagnosing Attention Failures in Video Language Models
- Authors: Hyunjong Ok, Jaeho Lee,
- Abstract summary: multimodal large language models (MLLMs) typically rely on sampling methods guided by vision-language encoders.<n>We show that popular vision encoders critically suffer from their limited capability to identify where the MLLM should look inside the video.<n>Our findings suggest that the development of better identification techniques may be necessary for efficient video MLLMs.
- Score: 13.389832365304263
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in multimodal large language models (MLLMs) have led to much progress in video understanding tasks. To avoid the heavy computational cost of processing all frames, these models typically rely on keyframe sampling methods guided by vision-language encoders (\textit{e.g.,} SigLIP). However, it remains unclear whether such encoders can truly identify the most informative frames. In this work, we provide several empirical pieces of evidence revealing that popular vision encoders critically suffer from their limited capability to identify where the MLLM should look inside the video to handle the given textual query appropriately. Our findings suggest that the development of better keyframe identification techniques may be necessary for efficient video MLLMs.
Related papers
- An Empirical Study for Representations of Videos in Video Question Answering via MLLMs [4.726627693005334]
Multimodal large language models have recently achieved remarkable progress in video question answering.<n>It remains unclear which video representations are most effective for MLLMs, and how different modalities balance task accuracy against computational efficiency.
arXiv Detail & Related papers (2025-10-14T09:02:22Z) - SiLVR: A Simple Language-based Video Reasoning Framework [71.77141065418238]
We present SiLVR, a Simple Language-based Video Reasoning framework.<n>In the first stage, SiLVR transforms raw video into language-based representations using multisensory inputs.<n>In the second stage, language descriptions are fed into a powerful reasoning LLM to solve complex video-language understanding tasks.
arXiv Detail & Related papers (2025-05-30T17:59:19Z) - Video Summarization with Large Language Models [41.51242348081083]
We propose a new video summarization framework that leverages the capabilities of recent Large Language Models (LLMs)<n>Our method, dubbed LLM-based Video Summarization (LLMVS), translates video frames into a sequence of captions using a Muti-modal Large Language Model (MLLM)<n>Our experimental results demonstrate the superiority of the proposed method over existing ones in standard benchmarks.
arXiv Detail & Related papers (2025-04-15T13:56:14Z) - STORM: Token-Efficient Long Video Understanding for Multimodal LLMs [116.4479155699528]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - Adaptive Keyframe Sampling for Long Video Understanding [75.7837692594814]
This paper presents a simple yet effective algorithm named Adaptive Keyframe Sampling (AKS)<n>It inserts a plug-and-play module known as Adaptive Keyframe Sampling (AKS) which aims to maximize the useful information with a fixed number of video tokens.<n>Experiments on two long video understanding benchmarks validate that AKS improves video QA accuracy upon selecting informative encounters.
arXiv Detail & Related papers (2025-02-28T17:46:29Z) - Language Models Can See Better: Visual Contrastive Decoding For LLM Multimodal Reasoning [15.877954360180468]
Training Multimodal Large Language Models (MLLMs) is resource-intensive and constrained by various training limitations.<n>We propose the Modular-based Visual Contrastive Decoding (MVCD) framework to move this obstacle.<n>Our framework leverages LLMs' In-Context Learning (ICL) capability and the proposed visual contrastive-example decoding (CED)<n>Our results show consistent improvement in model accuracy and well explain the effective components inside our decoding strategy.
arXiv Detail & Related papers (2025-02-17T12:47:00Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools [44.78291853329394]
textbfVidCoM is a fast adaptive framework that leverages Large Language Models (LLMs) to reason about videos using lightweight visual tools.
An InsOVER algorithm locates the corresponding video events based on an efficient Hungarian matching between decompositions of linguistic instructions and video events.
arXiv Detail & Related papers (2023-10-16T17:05:56Z) - LAVENDER: Unifying Video-Language Understanding as Masked Language
Modeling [102.42424022921243]
Masked Language Modeling (MLM) is used as the common interface for all pre-training and downstream tasks.
Experiments show that this unified framework achieves competitive performance on 14 VidL benchmarks.
arXiv Detail & Related papers (2022-06-14T20:43:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.