Video Panels for Long Video Understanding
- URL: http://arxiv.org/abs/2509.23724v1
- Date: Sun, 28 Sep 2025 08:05:55 GMT
- Title: Video Panels for Long Video Understanding
- Authors: Lars Doorenbos, Federico Spurio, Juergen Gall,
- Abstract summary: We propose a novel visual prompting strategy specifically designed for long-video understanding.<n>By combining multiple frames as panels into one image, we effectively trade off spatial details for temporal resolution.<n>Our approach is training-free, parameter-free, and model-agnostic, and can be seamlessly integrated into existing Video-Language Models.
- Score: 25.560912635941662
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent Video-Language Models (VLMs) achieve promising results on long-video understanding, but their performance still lags behind that achieved on tasks involving images or short videos. This has led to great interest in improving the long context modeling of VLMs by introducing novel modules and additional complexity. % additional training time. In this paper, we take a different approach: rather than fine-tuning VLMs with the limited data available, we attempt to maximize the performance of existing models. To this end, we propose a novel visual prompting strategy specifically designed for long-video understanding. By combining multiple frames as panels into one image, we effectively trade off spatial details for temporal resolution. Our approach is training-free, parameter-free, and model-agnostic, and can be seamlessly integrated into existing VLMs. Extensive experiments on five established benchmarks across a wide range of model architectures, sizes, and context windows confirm the consistency of our approach. For the TimeScope (Long) dataset, which has the longest videos, the accuracy for video question answering is improved by up to 19.4\%. Overall, our method raises the bar for long video understanding models. We will make our code available upon acceptance.
Related papers
- VideoBrain: Learning Adaptive Frame Sampling for Long Video Understanding [9.415923244280542]
VideoBrain is an end-to-end framework that enables Vision-Language Models to adaptively acquire visual information through learned sampling policies.<n>Our approach features dual complementary agents: a CLIP-based agent for semantic retrieval across the video and a Uniform agent for dense temporal sampling within intervals.
arXiv Detail & Related papers (2026-02-04T00:08:35Z) - Temporal Chain of Thought: Long-Video Understanding by Thinking in Frames [70.93346841539626]
We present Temporal Chain of Thought, an inference strategy for video question-answering.<n>We use the VLM itself to iteratively identify and extract the most relevant frames from the video.<n>We demonstrate how leveraging more computation at inference-time to select the most relevant context leads to improvements in accuracy.
arXiv Detail & Related papers (2025-07-01T18:39:26Z) - Universal Video Temporal Grounding with Generative Multi-modal Large Language Models [59.781211641591405]
This paper presents a computational model for universal video temporal grounding, which accurately localizes temporal moments in videos based on natural language queries.<n>We propose UniTime, a robust and universal video grounding model leveraging the strong vision-language understanding capabilities of generative Multi-modal Large Language Models (MLLMs)<n>Our model effectively handles videos of diverse views, genres, and lengths while comprehending complex language queries.
arXiv Detail & Related papers (2025-06-23T17:53:18Z) - Moment Sampling in Video LLMs for Long-Form Video QA [22.638644170177013]
"moment sampling" is a model-agnostic approach that enables the model to select the most relevant frames according to the context of the question.<n>By focusing on the frames most pertinent to the given question, our method enhances long-form VideoQA performance in Video LLMs.
arXiv Detail & Related papers (2025-06-18T03:23:56Z) - STORM: Token-Efficient Long Video Understanding for Multimodal LLMs [116.4479155699528]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling [56.130911402831906]
This paper aims to improve the performance of video large language models (LM) via long and rich context (LRC) modeling.<n>We develop a new version of InternVideo2.5 with focus on enhancing the original MLLMs' ability to perceive fine-grained details in videos.<n> Experimental results demonstrate this unique designML LRC greatly improves the results of video MLLM in mainstream understanding benchmarks.
arXiv Detail & Related papers (2025-01-21T18:59:00Z) - GIRAFFE: Design Choices for Extending the Context Length of Visual Language Models [20.976319536167512]
We aim to establish an effective solution that enhances long context performance of Visual Language Models.<n>We propose Giraffe, which is effectively extended to 128K lengths.<n>We will open-source the code, data, and models.
arXiv Detail & Related papers (2024-12-17T09:57:21Z) - Visual Context Window Extension: A New Perspective for Long Video Understanding [45.134271969594614]
We tackle the challenge of long video understanding from the perspective of context windows.
We propose to adapt LMMs for long video understanding tasks by extending the visual context window.
Our method consistently improves the performance as the number of video frames increases.
arXiv Detail & Related papers (2024-09-30T07:25:16Z) - MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding [66.56100008577134]
This study focuses on designing an efficient and effective model for long-term video understanding.
We propose to process videos in an online manner and store past video information in a memory bank.
Our model can achieve state-of-the-art performances across multiple datasets.
arXiv Detail & Related papers (2024-04-08T17:59:24Z) - LongVLM: Efficient Long Video Understanding via Large Language Models [55.813206751150716]
LongVLM is a simple yet powerful VideoLLM for long video understanding.
We encode video representations that incorporate both local and global information.
Our model produces more precise responses for long video understanding.
arXiv Detail & Related papers (2024-04-04T11:33:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.