Related papers: Video-QTR: Query-Driven Temporal Reasoning Framework for Lightweight Video Understanding

Video-QTR: Query-Driven Temporal Reasoning Framework for Lightweight Video Understanding

URL: http://arxiv.org/abs/2512.09354v1
Date: Wed, 10 Dec 2025 06:28:00 GMT
Title: Video-QTR: Query-Driven Temporal Reasoning Framework for Lightweight Video Understanding
Authors: Xinkui Zhao, Zuxin Wang, Yifan Zhang, Guanjie Cheng, Yueshen Xu, Shuiguang Deng, Chang Liu, Naibo Wang, Jianwei Yin,
Abstract summary: Video-QTR is a lightweight framework that redefines video comprehension as a query-guided reasoning process.<n>We show that Video-QTR achieves state-of-the-art performance while reducing input frame consumption by up to 73%.
Score: 37.682165829414494
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid development of multimodal large-language models (MLLMs) has significantly expanded the scope of visual language reasoning, enabling unified systems to interpret and describe complex visual content. However, applying these models to long-video understanding remains computationally intensive. Dense frame encoding generates excessive visual tokens, leading to high memory consumption, redundant computation, and limited scalability in real-world applications. This inefficiency highlights a key limitation of the traditional process-then-reason paradigm, which analyzes visual streams exhaustively before semantic reasoning. To address this challenge, we introduce Video-QTR (Query-Driven Temporal Reasoning), a lightweight framework that redefines video comprehension as a query-guided reasoning process. Instead of encoding every frame, Video-QTR dynamically allocates perceptual resources based on the semantic intent of the query, creating an adaptive feedback loop between reasoning and perception. Extensive experiments across five benchmarks: MSVD-QA, Activity Net-QA, Movie Chat, and Video MME demonstrate that Video-QTR achieves state-of-the-art performance while reducing input frame consumption by up to 73%. These results confirm that query-driven temporal reasoning provides an efficient and scalable solution for video understanding.

Related papers

TV-RAG: A Temporal-aware and Semantic Entropy-Weighted Framework for Long Video Retrieval and Understanding [14.570869250170139]
TV-RAG is a training-free architecture that couples temporal alignment with entropy-guided semantics to improve long-video reasoning.<n>By weaving these temporal and semantic signals together, TV-RAG realises a dual-level reasoning routine that can be grafted onto any LVLM without re-training or fine-tuning.
arXiv Detail & Related papers (2025-12-29T14:10:22Z)
FLoC: Facility Location-Based Efficient Visual Token Compression for Long Video Understanding [55.700832127331324]
FLoC is an efficient visual token compression framework based on the facility location function.<n>Our method achieves remarkable efficiency gains by swiftly selecting a compact subset of tokens.<n>Our approach is training-free, model-agnostic, and query-agnostic, providing a versatile solution.
arXiv Detail & Related papers (2025-10-31T17:29:39Z)
Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding [56.45689495743107]
Vgent is a graph-based retrieval-reasoning-augmented generation framework to enhance LVLMs for long video understanding.<n>We evaluate our framework with various open-source LVLMs on three long-video understanding benchmarks.
arXiv Detail & Related papers (2025-10-15T19:14:58Z)
Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding [33.58579390725519]
Video-MTR is a reinforced multi-turn reasoning framework designed to enable iterative key video segment selection and question comprehension.<n>Unlike traditional video reasoning pipeline, which generate predictions in a single turn, Video-MTR performs reasoning in multiple turns.<n>To ensure intermediate reasoning process, we introduce a novel gated bi-level reward system.
arXiv Detail & Related papers (2025-08-28T06:55:08Z)
Episodic Memory Representation for Long-form Video Understanding [52.33907540905242]
Large Video Language Models excel at general video understanding but struggle with long-form context window limits.<n>We introduce Video-EM, a training free framework inspired by the principles of human memory.<n>Video-EM achieves performance gains of 4-9 percent over respective baselines while utilizing fewer frames.
arXiv Detail & Related papers (2025-08-13T04:33:07Z)
AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding [73.60257070465377]
AdaVideoRAG is a novel framework that adapts retrieval based on query complexity using a lightweight intent classifier.<n>Our framework employs an Omni-Knowledge Indexing module to build hierarchical databases from text (captions, ASR, OCR), visual features, and semantic graphs.<n> Experiments demonstrate improved efficiency and accuracy for long-video understanding, with seamless integration into existing MLLMs.
arXiv Detail & Related papers (2025-06-16T15:18:15Z)
SiLVR: A Simple Language-based Video Reasoning Framework [71.77141065418238]
We present SiLVR, a Simple Language-based Video Reasoning framework.<n>In the first stage, SiLVR transforms raw video into language-based representations using multisensory inputs.<n>In the second stage, language descriptions are fed into a powerful reasoning LLM to solve complex video-language understanding tasks.
arXiv Detail & Related papers (2025-05-30T17:59:19Z)
REVEAL: Relation-based Video Representation Learning for Video-Question-Answering [14.867263291053968]
We propose RElation-based rEpresentAtion Learning (REVEAL) to capture visual relation information.<n>Inspired by bytemporal scene graphs, we encode video sequences as sets of relation triplets in the form of (subjectit-predicate-object) over time via their language embeddings.<n>We evaluate the proposed framework on five challenging benchmarks: NeXT-QA, Intent-QA, STAR, VLEP, and TVQA.
arXiv Detail & Related papers (2025-04-07T19:54:04Z)
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding [14.464718780172582]
We introduce HierarQ, a task-aware hierarchical Q-Former based framework that sequentially processes frames to bypass the need for frame sampling.<n>We introduce a lightweight two-stream language-guided feature modulator to incorporate task awareness in video understanding.<n>Extensive evaluations on 10 video benchmarks across video understanding, question answering, and captioning tasks demonstrate HierarQ's state-of-the-art performance.
arXiv Detail & Related papers (2025-03-11T16:21:23Z)
STORM: Token-Efficient Long Video Understanding for Multimodal LLMs [116.4479155699528]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.