Related papers: Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding

Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding

URL: http://arxiv.org/abs/2512.04000v1
Date: Wed, 03 Dec 2025 17:36:06 GMT
Title: Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding
Authors: Jialuo Li, Bin Li, Jiahao Li, Yan Lu,
Abstract summary: We propose a training-free frame selection framework that adapts its strategy based on the query type.<n> Experiments on three long-form video understanding benchmarks demonstrate that DIG consistently outperforms existing baselines.
Score: 21.18266593437182
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The application of Large Multimodal Models (LMMs) to long-form video understanding is constrained by limited context lengths and the computationally prohibitive cost of processing dense video tokens. Consequently, recent research has focused on query-aware frame selection, methods that often incur significant computational overhead. This paper challenges the assumption that such complex search mechanisms are universally necessary. We first identify and validate a query typology distinguishing between global query and localized query. We demonstrate that while uniform sampling is both effective and efficient for global queries, localized queries indeed necessitate query-aware selection for optimal performance. Building on this insight, we propose DIG, a training-free frame selection framework that adapts its strategy based on the query type. Specifically,DIG employs efficient uniform sampling for global queries while activating a specialized pipeline to extract query-relevant frames for localized queries. Experiments on three long-form video understanding benchmarks demonstrate that DIG consistently outperforms existing baselines and robustly improves LMM performance, even when scaling the input frame count to 256.

Related papers

Beyond Caption-Based Queries for Video Moment Retrieval [60.31221310786333]
We investigate degradation of VMR methods when trained on caption-based queries but evaluated on search queries.<n>We introduce three benchmarks by modifying the textual queries in three public VMR datasets.<n>Our approach improves performance on search queries by up to 14.82% mAP_m, and up to 21.83% mAP_m on multi-moment search queries.
arXiv Detail & Related papers (2026-03-02T20:06:41Z)
When should I search more: Adaptive Complex Query Optimization with Reinforcement Learning [26.489185170468062]
We propose a novel RL framework called Adaptive Complex Query Optimization (ACQO)<n>Our framework is designed to adaptively determine when and how to expand the search process.<n>ACQO achieves state-of-the-art performance on three complex query benchmarks, significantly outperforming established baselines.
arXiv Detail & Related papers (2026-01-29T03:16:53Z)
Pixel-Grounded Retrieval for Knowledgeable Large Multimodal Models [58.46663983451155]
PixSearch is an end-to-end Segmenting Large Multimodal Model (LMM) that unifies region-level perception and retrieval-augmented reasoning.<n>During encoding, PixSearch emits search> tokens to trigger retrieval, selects query modalities (text, image, or region), and generates pixel-level masks that directly serve as visual queries.<n>On egocentric and entity-centric VQA benchmarks, PixSearch substantially improves factual consistency and generalization.
arXiv Detail & Related papers (2026-01-27T00:46:08Z)
Reasoning-enhanced Query Understanding through Decomposition and Interpretation [87.56450566014625]
ReDI is a Reasoning-enhanced approach for query understanding through Decomposition and Interpretation.<n>We compiled a large-scale dataset of real-world complex queries from a major search engine.<n> Experiments on BRIGHT and BEIR demonstrate that ReDI consistently surpasses strong baselines in both sparse and dense retrieval paradigms.
arXiv Detail & Related papers (2025-09-08T10:58:42Z)
LOVO: Efficient Complex Object Query in Large-Scale Video Datasets [11.821229903544404]
LOVO is a novel system designed to efficiently handle comp$underlineL$ex $underlineO$bject queries in large-scale $underlineV$ide$underlineO$ datasets.<n>Agnostic to user queries, LOVO performs one-time feature extraction using pre-trained visual encoders, generating compact visual embeddings for key frames.<n>During the query phase, LOVO transforms object queries to query embeddings and conducts fast approximate nearest-neighbor searches on the visual embeddings.
arXiv Detail & Related papers (2025-07-18T18:21:43Z)
AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding [73.60257070465377]
AdaVideoRAG is a novel framework that adapts retrieval based on query complexity using a lightweight intent classifier.<n>Our framework employs an Omni-Knowledge Indexing module to build hierarchical databases from text (captions, ASR, OCR), visual features, and semantic graphs.<n> Experiments demonstrate improved efficiency and accuracy for long-video understanding, with seamless integration into existing MLLMs.
arXiv Detail & Related papers (2025-06-16T15:18:15Z)
POQD: Performance-Oriented Query Decomposer for Multi-vector retrieval [8.05982973499578]
Performance-Oriented Query Decomposer (POQD) is a novel query decomposition framework for Multi- Retrieval (MVR)<n>POQD can be integrated seamlessly into arbitrary retrieval-based systems such as Retrieval-Augmented Generation (RAG) systems.
arXiv Detail & Related papers (2025-05-25T15:31:52Z)
Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding [23.022070084937603]
We introduce a semantics-driven search framework that reformulates selection under the paradigm of Visual Semantic-Logical Search.<n>Our method establishes new SOTA performance on the manually annotated benchmark in key-frame selection metrics.
arXiv Detail & Related papers (2025-03-17T13:07:34Z)
Action tube generation by person query matching for spatio-temporal action detection [0.0]
Method generates action tubes from the original video without relying on post-processing steps such as IoU-based linking and clip splitting.<n>Our approach applies query-based detection (DETR) to each frame and matches DETR queries to link the same person across frames.<n>Action classes are predicted using the sequence of queries obtained from QMM matching, allowing for variable-length inputs from videos longer than a single clip.
arXiv Detail & Related papers (2025-03-17T09:26:06Z)
Effective Instruction Parsing Plugin for Complex Logical Query Answering on Knowledge Graphs [51.33342412699939]
Knowledge Graph Query Embedding (KGQE) aims to embed First-Order Logic (FOL) queries in a low-dimensional KG space for complex reasoning over incomplete KGs. Recent studies integrate various external information (such as entity types and relation context) to better capture the logical semantics of FOL queries. We propose an effective Query Instruction Parsing (QIPP) that captures latent query patterns from code-like query instructions.
arXiv Detail & Related papers (2024-10-27T03:18:52Z)
Bridging Information Asymmetry in Text-video Retrieval: A Data-centric Approach [56.610806615527885]
A key challenge in text-video retrieval (TVR) is the information asymmetry between video and text.<n>This paper introduces a data-centric framework to bridge this gap by enriching textual representations to better match the richness of video content.<n>We propose a query selection mechanism that identifies the most relevant and diverse queries, reducing computational cost while improving accuracy.
arXiv Detail & Related papers (2024-08-14T01:24:09Z)
Temporal-aware Hierarchical Mask Classification for Video Semantic Segmentation [62.275143240798236]
Video semantic segmentation dataset has limited categories per video. Less than 10% of queries could be matched to receive meaningful gradient updates during VSS training. Our method achieves state-of-the-art performance on the latest challenging VSS benchmark VSPW without bells and whistles.
arXiv Detail & Related papers (2023-09-14T20:31:06Z)
JoinGym: An Efficient Query Optimization Environment for Reinforcement Learning [58.71541261221863]
Join order selection (JOS) is the problem of ordering join operations to minimize total query execution cost. We present JoinGym, a query optimization environment for bushy reinforcement learning (RL) Under the hood, JoinGym simulates a query plan's cost by looking up intermediate result cardinalities from a pre-computed dataset.
arXiv Detail & Related papers (2023-07-21T17:00:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.