Video-LLMs with Temporal Visual Screening
- URL: http://arxiv.org/abs/2508.21094v2
- Date: Sat, 13 Sep 2025 20:47:32 GMT
- Title: Video-LLMs with Temporal Visual Screening
- Authors: Zheyu Fan, Jiateng Liu, Yuji Zhang, Zihan Wang, Yi R. Fung, Manling Li, Heng Ji,
- Abstract summary: Temporal Visual Screening (TVS) is a new task that universally pre-processes video question answering and instruction tuning data.<n>TVS is formulated as a modular front-end adapter task that can be seamlessly integrated into both Video Instruction Tuning (training) and Video Question Answering (inference) pipelines.<n> Experiments demonstrate that incorporating TVS yields relative gains of 7.33% (training) and 34.6% (inference)
- Score: 59.18455762289321
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Humans naturally perform temporal screening by dragging the progress bar and focusing on salient temporal segments, but current Video Large Language Models (Video-LLMs) struggle to capture fine-grained temporal semantics due to sparse frame sampling and insufficient inter-frame reasoning supervision during their training. To address this, Inspired by well-established cognitive science principles, we propose Temporal Visual Screening (TVS), a new task that universally pre-processes video question answering and instruction tuning data by: (1) retaining focus-critical video segments, (2) synchronously reconstructing queries to their most direct form while preserving answer consistency, and (3) keeping the invariance and consistency for any possible answer. TVS is formulated as a modular front-end adapter task that can be seamlessly integrated into both Video Instruction Tuning (training) and Video Question Answering (inference) pipelines. TVS optimizes distribution of reasoning burden and cognitive load; during training, it aligns queries with focus-critical visual information; at inference, it enables query-aware segment focus and streamlined query representations. In particular, we curate the first benchmark for TVS and propose ReSimplifyIt, a baseline outperforming prior approaches on seemingly similar tasks by 0.47 in F-1 score on video trimming while achieving competitive query rewriting performance. Experiments demonstrate that incorporating TVS yields relative gains of 7.33% (training) and 34.6% (inference), demonstrating the effectiveness of temporal information screening for improving video-language understanding.
Related papers
- RANKVIDEO: Reasoning Reranking for Text-to-Video Retrieval [99.33724613432922]
We introduce RANKVIDEO, a reasoning-based reranker for video retrieval.<n>RANKVIDEO explicitly reasons over query-video pairs using video content to assess relevance.<n> Experiments on the large-scale MultiVENT 2.0 benchmark demonstrate that RANKVIDEO consistently improves retrieval performance within a two-stage framework.
arXiv Detail & Related papers (2026-02-02T18:40:37Z) - SeViCES: Unifying Semantic-Visual Evidence Consensus for Long Video Understanding [36.30263540665245]
We propose a framework for effective and reliable long video understanding.<n>SeViCES is training-free and model-agnostic, and introduces two key components.<n> experiments on long video understanding benchmarks show that SeViCES consistently outperforms state-of-the-art methods in both accuracy and robustness.
arXiv Detail & Related papers (2025-10-23T14:55:28Z) - Episodic Memory Representation for Long-form Video Understanding [52.33907540905242]
Large Video Language Models excel at general video understanding but struggle with long-form context window limits.<n>We introduce Video-EM, a training free framework inspired by the principles of human memory.<n>Video-EM achieves performance gains of 4-9 percent over respective baselines while utilizing fewer frames.
arXiv Detail & Related papers (2025-08-13T04:33:07Z) - Emergent Temporal Correspondences from Video Diffusion Transformers [30.83001895223298]
We introduce DiffTrack, the first quantitative analysis framework designed to answer this question.<n>Our analysis reveals that query-key similarities in specific, but not all, layers play a critical role in temporal matching.<n>We extend our findings to motion-enhanced video generation with a novel guidance method that improves temporal consistency of generated videos without additional training.
arXiv Detail & Related papers (2025-06-20T17:59:55Z) - Prompts to Summaries: Zero-Shot Language-Guided Video Summarization [12.200609701777907]
We introduce Prompts-to-Summaries: the first zero-shot, text-queryable video summarizer.<n>It converts off-the-shelf video-language models (VidLMs) captions into user-guided skims via large language models (LLMs) judging.<n>Our pipeline generates rich scene-level descriptions through a memory-efficient, batch-style VidLM prompting scheme.<n>On SumMe and TVSum, our data-free approach surpasses all prior data-hungry unsupervised methods.
arXiv Detail & Related papers (2025-06-12T15:23:11Z) - Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding [57.26400319795876]
Temporal Video Grounding (TVG) is a core challenge in long-form video understanding.<n>Recent Large Vision-Language Models (LVLMs) have shown early promise in tackling TVG through supervised fine-tuning.<n>We propose a novel post-training framework that enhances the generalization capabilities of LVLMs via reinforcement learning.
arXiv Detail & Related papers (2025-03-17T17:04:20Z) - Temporal Reasoning Transfer from Text to Video [51.68487044397409]
Video Large Language Models (Video LLMs) struggle with tracking temporal changes and reasoning about temporal relationships.
We introduce the Textual Temporal reasoning Transfer (T3) to transfer temporal reasoning abilities from text to video domains.
LongVA-7B model achieves competitive performance on comprehensive video benchmarks.
arXiv Detail & Related papers (2024-10-08T16:10:29Z) - Temporal Sentence Grounding in Streaming Videos [60.67022943824329]
This paper aims to tackle a novel task - Temporal Sentence Grounding in Streaming Videos (TSGSV)
The goal of TSGSV is to evaluate the relevance between a video stream and a given sentence query.
We propose two novel methods: (1) a TwinNet structure that enables the model to learn about upcoming events; and (2) a language-guided feature compressor that eliminates redundant visual frames.
arXiv Detail & Related papers (2023-08-14T12:30:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.