Related papers: SlowFocus: Enhancing Fine-grained Temporal Understanding in Video LLM

SlowFocus: Enhancing Fine-grained Temporal Understanding in Video LLM

URL: http://arxiv.org/abs/2602.03589v1
Date: Tue, 03 Feb 2026 14:39:16 GMT
Title: SlowFocus: Enhancing Fine-grained Temporal Understanding in Video LLM
Authors: Ming Nie, Dan Ding, Chunwei Wang, Yuanfan Guo, Jianhua Han, Hang Xu, Li Zhang,
Abstract summary: Large language models (LLMs) have demonstrated exceptional capabilities in text understanding.<n>Vid-LLMs struggle to simultaneously retain high-quality frame-level semantic information.<n>This limitation hinders the advancement of Vid-LLMs towards fine-grained video understanding.
Score: 36.28285195488772
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large language models (LLMs) have demonstrated exceptional capabilities in text understanding, which has paved the way for their expansion into video LLMs (Vid-LLMs) to analyze video data. However, current Vid-LLMs struggle to simultaneously retain high-quality frame-level semantic information (i.e., a sufficient number of tokens per frame) and comprehensive video-level temporal information (i.e., an adequate number of sampled frames per video). This limitation hinders the advancement of Vid-LLMs towards fine-grained video understanding. To address this issue, we introduce the SlowFocus mechanism, which significantly enhances the equivalent sampling frequency without compromising the quality of frame-level visual tokens. SlowFocus begins by identifying the query-related temporal segment based on the posed question, then performs dense sampling on this segment to extract local high-frequency features. A multi-frequency mixing attention module is further leveraged to aggregate these local high-frequency details with global low-frequency contexts for enhanced temporal comprehension. Additionally, to tailor Vid-LLMs to this innovative mechanism, we introduce a set of training strategies aimed at bolstering both temporal grounding and detailed temporal reasoning capabilities. Furthermore, we establish FineAction-CGR, a benchmark specifically devised to assess the ability of Vid-LLMs to process fine-grained temporal understanding tasks. Comprehensive experiments demonstrate the superiority of our mechanism across both existing public video understanding benchmarks and our proposed FineAction-CGR.

Related papers

Harnessing Synthetic Preference Data for Enhancing Temporal Understanding of Video-LLMs [54.502280390499756]
We propose TimeWarp to create a targeted synthetic temporal dataset to fine-tune the model's responses to encourage it to focus on the given input video.<n>We demonstrate that when our method is applied to existing models, it significantly improves performance on temporal understanding benchmarks.
arXiv Detail & Related papers (2025-10-04T21:48:40Z)
Episodic Memory Representation for Long-form Video Understanding [52.33907540905242]
Large Video Language Models excel at general video understanding but struggle with long-form context window limits.<n>We introduce Video-EM, a training free framework inspired by the principles of human memory.<n>Video-EM achieves performance gains of 4-9 percent over respective baselines while utilizing fewer frames.
arXiv Detail & Related papers (2025-08-13T04:33:07Z)
KFFocus: Highlighting Keyframes for Enhanced Video Understanding [33.69757683688046]
We propose KFFocus, a method designed to efficiently compress video tokens and emphasize the informative context present within video frames.<n>By assigning varying condensation ratios to frames based on their contextual relevance, KFFocus efficiently reduces token redundancy while preserving informative content details.<n>We also introduce a multimodal modeling module that encodes both the temporal relationships between video frames and the spatial structure within each frame.
arXiv Detail & Related papers (2025-08-12T14:57:03Z)
APVR: Hour-Level Long Video Understanding with Adaptive Pivot Visual Information Retrieval [41.81696346270799]
Current large language models (LMs) struggle with hour-level video understanding.<n>bftextAdaptive textbfPivot MLbfVisual information textbfRetrieval (textbfAPVR), a training-free framework that hierarchically retrieves and retains sufficient and important visual information.
arXiv Detail & Related papers (2025-06-05T12:27:10Z)
Video Summarization with Large Language Models [41.51242348081083]
We propose a new video summarization framework that leverages the capabilities of recent Large Language Models (LLMs)<n>Our method, dubbed LLM-based Video Summarization (LLMVS), translates video frames into a sequence of captions using a Muti-modal Large Language Model (MLLM)<n>Our experimental results demonstrate the superiority of the proposed method over existing ones in standard benchmarks.
arXiv Detail & Related papers (2025-04-15T13:56:14Z)
TIME: Temporal-Sensitive Multi-Dimensional Instruction Tuning and Robust Benchmarking for Video-LLMs [56.92627816895305]
Video large language models have achieved remarkable performance in tasks such as video question answering.<n>Our dataset focuses on enhancing temporal comprehension across five key dimensions.<n>We introduce a multi-task prompt fine-tuning approach that seamlessly integrates temporal-sensitive tasks into existing instruction datasets.
arXiv Detail & Related papers (2025-03-13T03:05:11Z)
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding [14.464718780172582]
We introduce HierarQ, a task-aware hierarchical Q-Former based framework that sequentially processes frames to bypass the need for frame sampling.<n>We introduce a lightweight two-stream language-guided feature modulator to incorporate task awareness in video understanding.<n>Extensive evaluations on 10 video benchmarks across video understanding, question answering, and captioning tasks demonstrate HierarQ's state-of-the-art performance.
arXiv Detail & Related papers (2025-03-11T16:21:23Z)
STORM: Token-Efficient Long Video Understanding for Multimodal LLMs [116.4479155699528]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z)
The Devil is in Temporal Token: High Quality Video Reasoning Segmentation [68.33080352141653]
Methods for Video Reasoning rely heavily on a single special token to represent the object in the video.<n>We propose VRS-HQ, an end-to-end video reasoning segmentation approach.<n>Our results highlight the strong temporal reasoning and segmentation capabilities of our method.
arXiv Detail & Related papers (2025-01-15T03:17:24Z)
VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information. At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings. At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.