An Empirical Study for Representations of Videos in Video Question Answering via MLLMs
- URL: http://arxiv.org/abs/2510.12299v1
- Date: Tue, 14 Oct 2025 09:02:22 GMT
- Title: An Empirical Study for Representations of Videos in Video Question Answering via MLLMs
- Authors: Zhi Li, Yanan Wang, Hao Niu, Julio Vizcarra, Masato Taya,
- Abstract summary: Multimodal large language models have recently achieved remarkable progress in video question answering.<n>It remains unclear which video representations are most effective for MLLMs, and how different modalities balance task accuracy against computational efficiency.
- Score: 4.726627693005334
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Multimodal large language models have recently achieved remarkable progress in video question answering (VideoQA) by jointly processing visual, textual, and audio information. However, it remains unclear which video representations are most effective for MLLMs, and how different modalities balance task accuracy against computational efficiency. In this work, we present a comprehensive empirical study of video representation methods for VideoQA with MLLMs. We systematically evaluate single modality inputs question only, subtitles, visual frames, and audio signals as well as multimodal combinations, on two widely used benchmarks: VideoMME and LongVideoBench. Our results show that visual frames substantially enhance accuracy but impose heavy costs in GPU memory and inference latency, while subtitles provide a lightweight yet effective alternative, particularly for long videos. These findings highlight clear trade-offs between effectiveness and efficiency and provide practical insights for designing resource-aware MLLM-based VideoQA systems.
Related papers
- VidVec: Unlocking Video MLLM Embeddings for Video-Text Retrieval [11.519642157641023]
This paper focuses on leveraging MLLMs for video-text embedding and retrieval.<n>We first conduct a systematic layer-wise analysis, showing that intermediate (pre-trained) MLLM layers already encode substantial task-relevant information.<n>We demonstrate that combining intermediate-layer embeddings with a calibrated MLLM head yields strong zero-shot retrieval performance without any training.
arXiv Detail & Related papers (2026-02-08T19:39:32Z) - SiLVR: A Simple Language-based Video Reasoning Framework [71.77141065418238]
We present SiLVR, a Simple Language-based Video Reasoning framework.<n>In the first stage, SiLVR transforms raw video into language-based representations using multisensory inputs.<n>In the second stage, language descriptions are fed into a powerful reasoning LLM to solve complex video-language understanding tasks.
arXiv Detail & Related papers (2025-05-30T17:59:19Z) - MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios [66.59827827146262]
We introduce the MME-VideoOCR benchmark, which encompasses a comprehensive range of video OCR application scenarios.<n>The benchmark consists of 1,464 videos with varying resolutions, aspect ratios, and durations, along with 2,000 meticulously curated, manually annotated question-answer pairs.<n>We evaluate 18 state-of-the-art MLLMs on MME-VideoOCR, revealing that even the best-performing model (Gemini-2.5 Pro) achieves an accuracy of only 73.7%.
arXiv Detail & Related papers (2025-05-27T15:27:46Z) - Video Summarization with Large Language Models [41.51242348081083]
We propose a new video summarization framework that leverages the capabilities of recent Large Language Models (LLMs)<n>Our method, dubbed LLM-based Video Summarization (LLMVS), translates video frames into a sequence of captions using a Muti-modal Large Language Model (MLLM)<n>Our experimental results demonstrate the superiority of the proposed method over existing ones in standard benchmarks.
arXiv Detail & Related papers (2025-04-15T13:56:14Z) - InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling [56.130911402831906]
This paper aims to improve the performance of video large language models (LM) via long and rich context (LRC) modeling.<n>We develop a new version of InternVideo2.5 with focus on enhancing the original MLLMs' ability to perceive fine-grained details in videos.<n> Experimental results demonstrate this unique designML LRC greatly improves the results of video MLLM in mainstream understanding benchmarks.
arXiv Detail & Related papers (2025-01-21T18:59:00Z) - TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models [52.590072198551944]
Recent advances in multimodal Large Language Models (LLMs) have shown great success in understanding multi-modal contents.
For video understanding tasks, training-based video LLMs are difficult to build due to the scarcity of high-quality, curated video-text paired data.
In this work, we explore the limitations of the existing compression strategies for building a training-free video LLM.
arXiv Detail & Related papers (2024-11-17T13:08:29Z) - Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs [56.040198387038025]
We present a novel prompt-guided visual perception framework (abbreviated as Free Video-LLM) for efficient inference of training-free video LLMs.
Our method effectively reduces the number of visual tokens while maintaining high performance across multiple video question-answering benchmarks.
arXiv Detail & Related papers (2024-10-14T12:35:12Z) - Understanding Long Videos with Multimodal Language Models [44.78900245769057]
Large Language Models (LLMs) have allowed recent approaches to achieve excellent performance on long-video understanding benchmarks.<n>We investigate how extensive world knowledge and strong reasoning skills of underlying LLMs influence this strong performance.<n>Our resulting Multimodal Video Understanding framework demonstrates state-of-the-art performance across multiple video understanding benchmarks.
arXiv Detail & Related papers (2024-03-25T17:59:09Z) - Long Video Understanding with Learnable Retrieval in Video-Language Models [48.3525267216256]
We introduce a learnable retrieval-based video-language model (R-VLM) for efficient long video understanding.<n>Specifically, given a question (Query) and a long video, our model identifies and selects the most relevant K video chunks.<n>This effectively reduces the number of video tokens, eliminates noise interference, and enhances system performance.
arXiv Detail & Related papers (2023-12-08T09:48:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.