Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding
- URL: http://arxiv.org/abs/2503.13139v1
- Date: Mon, 17 Mar 2025 13:07:34 GMT
- Title: Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding
- Authors: Weiyu Guo, Ziyang Chen, Shaoguang Wang, Jianxiang He, Yijie Xu, Jinhui Ye, Ying Sun, Hui Xiong,
- Abstract summary: We introduce a semantics-driven search framework that reformulates selection under the paradigm of Visual Semantic-Logical Search.<n>Our method establishes new SOTA performance on the manually annotated benchmark in key-frame selection metrics.
- Score: 23.022070084937603
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Understanding long video content is a complex endeavor that often relies on densely sampled frame captions or end-to-end feature selectors, yet these techniques commonly overlook the logical relationships between textual queries and visual elements. In practice, computational constraints necessitate coarse frame subsampling, a challenge analogous to ``finding a needle in a haystack.'' To address this issue, we introduce a semantics-driven search framework that reformulates keyframe selection under the paradigm of Visual Semantic-Logical Search. Specifically, we systematically define four fundamental logical dependencies: 1) spatial co-occurrence, 2) temporal proximity, 3) attribute dependency, and 4) causal order. These relations dynamically update frame sampling distributions through an iterative refinement process, enabling context-aware identification of semantically critical frames tailored to specific query requirements. Our method establishes new SOTA performance on the manually annotated benchmark in key-frame selection metrics. Furthermore, when applied to downstream video question-answering tasks, the proposed approach demonstrates the best performance gains over existing methods on LongVideoBench and Video-MME, validating its effectiveness in bridging the logical gap between textual queries and visual-temporal reasoning. The code will be publicly available.
Related papers
- Everything Can Be Described in Words: A Simple Unified Multi-Modal Framework with Semantic and Temporal Alignment [0.0]
Long Video Question Answering (LVQA) is challenging due to the need for temporal reasoning and large-scale multimodal data processing.<n>We introduce UMaT, a retrieval-augmented generation framework that efficiently processes extremely long videos.<n>We show that UMaT outperforms existing methods in multimodal integration, long-form video understanding, and sparse information retrieval.
arXiv Detail & Related papers (2025-03-12T05:28:24Z) - QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension [86.0749609778104]
We propose QuoTA, an ante-hoc training-free modular that extends existing large video-language models.<n>QuoTA strategically allocates frame-level importance scores based on query relevance.<n>We decouple the query through Chain-of-Thoughts reasoning to facilitate more precise LVLM-based frame importance scoring.
arXiv Detail & Related papers (2025-03-11T17:59:57Z) - HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding [14.464718780172582]
We introduce HierarQ, a task-aware hierarchical Q-Former based framework that sequentially processes frames to bypass the need for frame sampling.<n>We introduce a lightweight two-stream language-guided feature modulator to incorporate task awareness in video understanding.<n>Extensive evaluations on 10 video benchmarks across video understanding, question answering, and captioning tasks demonstrate HierarQ's state-of-the-art performance.
arXiv Detail & Related papers (2025-03-11T16:21:23Z) - Understanding Long Videos via LLM-Powered Entity Relation Graphs [51.13422967711056]
GraphVideoAgent is a framework that maps and monitors the evolving relationships between visual entities throughout the video sequence.<n>Our approach demonstrates remarkable effectiveness when tested against industry benchmarks.
arXiv Detail & Related papers (2025-01-27T10:57:24Z) - VidCtx: Context-aware Video Question Answering with Image Models [15.1350316858766]
We introduce VidCtx, a novel training-free VideoQA framework which integrates both visual information from input frames and textual descriptions of others frames.<n>Experiments show that VidCtx achieves competitive performance among approaches that rely on open models.
arXiv Detail & Related papers (2024-12-23T09:26:38Z) - Temporal Saliency Query Network for Efficient Video Recognition [82.52760040577864]
Video recognition is a hot-spot research topic with the explosive growth of multimedia data on the Internet and mobile devices.
Most existing methods select the salient frames without awareness of the class-specific saliency scores.
We propose a novel Temporal Saliency Query (TSQ) mechanism, which introduces class-specific information to provide fine-grained cues for saliency measurement.
arXiv Detail & Related papers (2022-07-21T09:23:34Z) - Video Corpus Moment Retrieval with Contrastive Learning [56.249924768243375]
Video corpus moment retrieval (VCMR) is to retrieve a temporal moment that semantically corresponds to a given text query.
We propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR.
Experimental results show that ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
arXiv Detail & Related papers (2021-05-13T12:54:39Z) - Temporal Query Networks for Fine-grained Video Understanding [88.9877174286279]
We cast this into a query-response mechanism, where each query addresses a particular question, and has its own response label set.
We evaluate the method extensively on the FineGym and Diving48 benchmarks for fine-grained action classification and surpass the state-of-the-art using only RGB features.
arXiv Detail & Related papers (2021-04-19T17:58:48Z) - Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval [98.62404433761432]
The rapid growth of user-generated videos on the Internet has intensified the need for text-based video retrieval systems.
Traditional methods mainly favor the concept-based paradigm on retrieval with simple queries.
We propose a Tree-augmented Cross-modal.
method by jointly learning the linguistic structure of queries and the temporal representation of videos.
arXiv Detail & Related papers (2020-07-06T02:50:27Z) - Video Monitoring Queries [16.7214343633499]
We study the problem of interactive declarative query processing on video streams.
We introduce a set of approximate filters to speed up queries that involve objects of specific type.
The filters are able to assess quickly if the query predicates are true to proceed with further analysis of the frame.
arXiv Detail & Related papers (2020-02-24T20:53:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.