SFA: Scan, Focus, and Amplify toward Guidance-aware Answering for Video TextVQA
- URL: http://arxiv.org/abs/2511.20190v1
- Date: Tue, 25 Nov 2025 11:14:39 GMT
- Title: SFA: Scan, Focus, and Amplify toward Guidance-aware Answering for Video TextVQA
- Authors: Haibin He, Qihuang Zhong, Juhua Liu, Bo Du, Peng Wang, Jing Zhang,
- Abstract summary: Video text-based visual question answering (Video TextVQA) task aims to answer questions about videos by leveraging the visual text appearing within the videos.<n>This task poses significant challenges, requiring models to accurately perceive and comprehend scene text that varies in scale, orientation, and clarity across frames.<n>We propose SFA, a training-free framework and the first Video-LLM-based method tailored for Video TextVQA, motivated by the human process of answering questions.
- Score: 44.90805518708208
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video text-based visual question answering (Video TextVQA) task aims to answer questions about videos by leveraging the visual text appearing within the videos. This task poses significant challenges, requiring models to accurately perceive and comprehend scene text that varies in scale, orientation, and clarity across frames, while effectively integrating temporal and semantic context to generate precise answers. Moreover, the model must identify question-relevant textual cues and filter out redundant or irrelevant information to ensure answering is guided by the most relevant and informative cues. To address these challenges, we propose SFA, a training-free framework and the first Video-LLM-based method tailored for Video TextVQA, motivated by the human process of answering questions. By adaptively scanning video frames, selectively focusing on key regions, and directly amplifying them, SFA effectively guides the Video-LLM's attention toward essential cues, enabling it to generate more accurate answers. SFA achieves new state-of-the-art results across several public Video TextVQA datasets and surpasses previous methods by a substantial margin, demonstrating its effectiveness and generalizability.
Related papers
- Query-Guided Spatial-Temporal-Frequency Interaction for Music Audio-Visual Question Answering [13.757806950813995]
Audio--Visual Question Answering (AVQA) is a challenging multimodal task that requires jointly reasoning over audio, visual, and textual information in a given video to answer natural language questions.<n>We propose a novel Query-guided Spatial--Temporal--Frequency interaction method to enhance audio--visual understanding.<n>Our proposed method achieves significant performance improvements over existing Audio QA, Visual QA, Video QA, and AVQA approaches.
arXiv Detail & Related papers (2026-01-27T17:24:32Z) - ViSS-R1: Self-Supervised Reinforcement Video Reasoning [84.1180294023835]
We introduce a novel self-supervised reinforcement learning GRPO algorithm (Pretext-GRPO) within the standard R1 pipeline.<n>We also propose the ViSS-R1 framework, which streamlines and integrates pretext-task-based self-supervised learning directly into the MLLM's R1 post-training paradigm.
arXiv Detail & Related papers (2025-11-17T07:00:42Z) - Video-LLMs with Temporal Visual Screening [59.18455762289321]
Temporal Visual Screening (TVS) is a new task that universally pre-processes video question answering and instruction tuning data.<n>TVS is formulated as a modular front-end adapter task that can be seamlessly integrated into both Video Instruction Tuning (training) and Video Question Answering (inference) pipelines.<n> Experiments demonstrate that incorporating TVS yields relative gains of 7.33% (training) and 34.6% (inference)
arXiv Detail & Related papers (2025-08-27T14:33:32Z) - FIQ: Fundamental Question Generation with the Integration of Question Embeddings for Video Question Answering [26.585985828583304]
Video question of answering (VQA) is a task that requires the interpretation of a video to answer a given question.<n>We propose a novel approach designed to strengthen the reasoning ability of model by enhancing the fundamental understanding of videos.
arXiv Detail & Related papers (2025-07-17T06:19:38Z) - Track the Answer: Extending TextVQA from Image to Video with Spatio-Temporal Clues [8.797350517975477]
Video text-based visual question answering (Video TextVQA) is a practical task that aims to answer questions by jointly textual reasoning and visual information in a given video.<n>We propose the TEA (stands for textbfTrack thbfE bftextAlanguageer'') method that better extends the generative TextVQA framework from image to video.
arXiv Detail & Related papers (2024-12-17T03:06:12Z) - Foundation Models and Adaptive Feature Selection: A Synergistic Approach to Video Question Answering [13.294004180200496]
We introduce Local-Global Question Aware Video Embedding (LGQAVE), which incorporates three major innovations to integrate multi-modal knowledge better.<n>LGQAVE moves beyond traditional ad-hoc frame sampling by utilizing a cross-attention mechanism that precisely identifies the most relevant frames concerning the questions.<n>An additional cross-attention module integrates these local and global embeddings to generate the final video embeddings, which a language model uses to generate answers.
arXiv Detail & Related papers (2024-12-12T12:39:07Z) - SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant [48.220285886328746]
We introduce a novel framework named SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant.
SQ-LLaVA exhibits proficiency in generating flexible and meaningful image-related questions while analyzing the visual clue and prior language knowledge.
Fine-tuning SQ-LLaVA on higher-quality instruction data shows a performance improvement compared with traditional visual-instruction tuning methods.
arXiv Detail & Related papers (2024-03-17T18:42:38Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer [79.20605034378187]
Video-language pre-trained models have shown remarkable success in guiding video question-answering tasks.
Due to the length of video sequences, training large-scale video-based models incurs considerably higher costs than training image-based ones.
This motivates us to leverage the knowledge from image-based pretraining, despite the obvious gaps between image and video domains.
arXiv Detail & Related papers (2023-08-16T15:00:50Z) - Video as Conditional Graph Hierarchy for Multi-Granular Question
Answering [80.94367625007352]
We argue that while video is presented in frame sequence, the visual elements are not sequential but rather hierarchical in semantic space.
We propose to model video as a conditional graph hierarchy which weaves together visual facts of different granularity in a level-wise manner.
arXiv Detail & Related papers (2021-12-12T10:35:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.