Adaptive Fast-and-Slow Visual Program Reasoning for Long-Form VideoQA
- URL: http://arxiv.org/abs/2509.17743v2
- Date: Tue, 23 Sep 2025 07:31:00 GMT
- Title: Adaptive Fast-and-Slow Visual Program Reasoning for Long-Form VideoQA
- Authors: Chenglin Li, Feng Han, Feng Tao, Ruilin Li, Qianglong Chen, Jingqi Tong, Yin Zhang, Jiaqi Wang,
- Abstract summary: We introduce the FSVisPR framework, an adaptive visual program reasoning approach.<n>It balances fast reasoning for simple queries with slow reasoning for difficult ones.<n>Experiments show that FS-VisPR improves both efficiency and reliability in visual program.
- Score: 36.10720855157895
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have shown promise in generating program workflows for visual tasks. However, previous approaches often rely on closed-source models, lack systematic reasoning, and struggle with long-form video question answering (videoQA). To address these challenges, we introduce the FS-VisPR framework, an adaptive visual program reasoning approach that balances fast reasoning for simple queries with slow reasoning for difficult ones. First, we design efficient visual modules (e.g., key clip retrieval and subtitle retrieval) to support long-form video tasks. Then, we construct a diverse and high-quality fast-slow reasoning dataset with a strong LLM to align open-source language models' ability to generate visual program workflows as FS-LLM. Next, we design a fast-slow reasoning framework with FS-LLM: Simple queries are directly solved by VideoLLMs, while difficult ones invoke visual program reasoning, motivated by human-like reasoning processes. During this process, low-confidence fast-thinking answers will trigger a second-stage slow-reasoning process, and a fallback mechanism to fast reasoning is activated if the program execution fails. Moreover, we improve visual programs through parameter search during both training and inference. By adjusting the parameters of the visual modules within the program, multiple variants are generated: during training, programs that yield correct answers are selected, while during inference, the program with the highest confidence result is applied. Experiments show that FS-VisPR improves both efficiency and reliability in visual program workflows. It achieves 50.4% accuracy on LVBench, surpassing GPT-4o, matching the performance of Qwen2.5VL-72B on VideoMME.
Related papers
- Video-QTR: Query-Driven Temporal Reasoning Framework for Lightweight Video Understanding [37.682165829414494]
Video-QTR is a lightweight framework that redefines video comprehension as a query-guided reasoning process.<n>We show that Video-QTR achieves state-of-the-art performance while reducing input frame consumption by up to 73%.
arXiv Detail & Related papers (2025-12-10T06:28:00Z) - SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference [49.84148668264725]
We present SparseVILA, a new paradigm for efficient VLM inference that decouples visual sparsity across the prefilling and decoding stages.<n>Built on an AWQ-optimized inference pipeline, SparseVILA achieves up to 4.0 times faster prefilling, 2.5 times faster decoding, and an overall 2.6 times end-to-end speedup on long-context video tasks.
arXiv Detail & Related papers (2025-10-20T17:35:47Z) - Video-LLMs with Temporal Visual Screening [59.18455762289321]
Temporal Visual Screening (TVS) is a new task that universally pre-processes video question answering and instruction tuning data.<n>TVS is formulated as a modular front-end adapter task that can be seamlessly integrated into both Video Instruction Tuning (training) and Video Question Answering (inference) pipelines.<n> Experiments demonstrate that incorporating TVS yields relative gains of 7.33% (training) and 34.6% (inference)
arXiv Detail & Related papers (2025-08-27T14:33:32Z) - Visual Program Distillation with Template-Based Augmentation [36.09275994799905]
We propose a low-cost visual program distillation method that requires no human-generated program annotations.<n>With a relatively small amount of question/answer data, small language models can generate high-quality specialized visual programs with the added benefit of much faster inference.
arXiv Detail & Related papers (2024-12-11T17:32:21Z) - Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs [56.040198387038025]
We present a novel prompt-guided visual perception framework (abbreviated as Free Video-LLM) for efficient inference of training-free video LLMs.
Our method effectively reduces the number of visual tokens while maintaining high performance across multiple video question-answering benchmarks.
arXiv Detail & Related papers (2024-10-14T12:35:12Z) - Investigating Video Reasoning Capability of Large Language Models with Tropes in Movies [69.28082193942991]
This paper introduces a novel dataset, Tropes in Movies (TiM), designed as a testbed for exploring two critical yet previously overlooked video reasoning skills.
utilizing tropes from movie storytelling, TiM evaluates the reasoning capabilities of state-of-the-art LLM-based approaches.
To address these deficiencies, we propose Face-Enhanced Viper of Role Interactions (FEVoRI) and Context Query Reduction (ConQueR)
arXiv Detail & Related papers (2024-06-16T12:58:31Z) - VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding [65.12464615430036]
This paper introduces a Video Understanding and Reasoning Framework (VURF) based on the reasoning power of Large Language Models (LLMs)<n>Ours is a novel approach to extend the utility of LLMs in the context of video tasks, leveraging their capacity to generalize from minimal input and output demonstrations within a contextual framework.
arXiv Detail & Related papers (2024-03-21T18:00:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.