Gather and Trace: Rethinking Video TextVQA from an Instance-oriented Perspective
- URL: http://arxiv.org/abs/2508.04197v1
- Date: Wed, 06 Aug 2025 08:26:36 GMT
- Title: Gather and Trace: Rethinking Video TextVQA from an Instance-oriented Perspective
- Authors: Yan Zhang, Gangyan Zeng, Daiqing Wu, Huawen Shen, Binbin Li, Yu Zhou, Can Ma, Xiaojun Bi,
- Abstract summary: Video text-based visual question answering (Video TextVQA) aims to answer questions by explicitly and reasoning about the text involved in a video.<n>Most works in this field follow a frame-level framework which suffers from redundant text entities and implicit relation modeling.<n>This paper proposes a novel model termed GAT (Gather and Trace) for Video TextVQA task.
- Score: 11.379162727778361
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Video text-based visual question answering (Video TextVQA) aims to answer questions by explicitly reading and reasoning about the text involved in a video. Most works in this field follow a frame-level framework which suffers from redundant text entities and implicit relation modeling, resulting in limitations in both accuracy and efficiency. In this paper, we rethink the Video TextVQA task from an instance-oriented perspective and propose a novel model termed GAT (Gather and Trace). First, to obtain accurate reading result for each video text instance, a context-aggregated instance gathering module is designed to integrate the visual appearance, layout characteristics, and textual contents of the related entities into a unified textual representation. Then, to capture dynamic evolution of text in the video flow, an instance-focused trajectory tracing module is utilized to establish spatio-temporal relationships between instances and infer the final answer. Extensive experiments on several public Video TextVQA datasets validate the effectiveness and generalization of our framework. GAT outperforms existing Video TextVQA methods, video-language pretraining methods, and video large language models in both accuracy and inference speed. Notably, GAT surpasses the previous state-of-the-art Video TextVQA methods by 3.86\% in accuracy and achieves ten times of faster inference speed than video large language models. The source code is available at https://github.com/zhangyan-ucas/GAT.
Related papers
- Track the Answer: Extending TextVQA from Image to Video with Spatio-Temporal Clues [8.797350517975477]
Video text-based visual question answering (Video TextVQA) is a practical task that aims to answer questions by jointly textual reasoning and visual information in a given video.<n>We propose the TEA (stands for textbfTrack thbfE bftextAlanguageer'') method that better extends the generative TextVQA framework from image to video.
arXiv Detail & Related papers (2024-12-17T03:06:12Z) - Vamos: Versatile Action Models for Video Understanding [23.631145570126268]
We propose versatile action models (Vamos), a learning framework powered by a large language model as the reasoner''
We evaluate Vamos on five benchmarks, Ego4D, NeXT-QA, IntentQA, Spacewalk-18, and Ego on its capability to model temporal dynamics, encode visual history, and perform reasoning.
arXiv Detail & Related papers (2023-11-22T17:44:24Z) - VicTR: Video-conditioned Text Representations for Activity Recognition [73.09929391614266]
We argue that better video-VLMs can be designed by focusing more on augmenting text, rather than visual information.
We introduce Video-conditioned Text Representations (VicTR), a form of text embeddings optimized w.r.t. visual embeddings.
Our model can further make use of freely-available semantic information, in the form of visually-grounded auxiliary text.
arXiv Detail & Related papers (2023-04-05T16:30:36Z) - Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR)
Existing methods rely on separate pre-training feature extractors for visual and textual understanding.
We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z) - Temporal Perceiving Video-Language Pre-training [112.1790287726804]
This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment.
Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description.
Our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality.
arXiv Detail & Related papers (2023-01-18T12:15:47Z) - Text-Adaptive Multiple Visual Prototype Matching for Video-Text
Retrieval [125.55386778388818]
Cross-modal retrieval between videos and texts has gained increasing research interest due to the rapid emergence of videos on the web.
We propose a Text-Adaptive Multiple Visual Prototype Matching model, which automatically captures multiple prototypes to describe a video.
Our method outperforms state-of-the-art methods on four public video retrieval datasets.
arXiv Detail & Related papers (2022-09-27T11:13:48Z) - Bi-Calibration Networks for Weakly-Supervised Video Representation
Learning [153.54638582696128]
We introduce a new design of mutual calibration between query and text to boost weakly-supervised video representation learning.
We present Bi-Calibration Networks (BCN) that novelly couples two calibrations to learn the amendment from text to query and vice versa.
BCN learnt on 3M web videos obtain superior results under linear model protocol on downstream tasks.
arXiv Detail & Related papers (2022-06-21T16:02:12Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval [26.581384985173116]
In text-video retrieval, the objective is to learn a cross-modal similarity function between a text and a video.
We propose a cross-modal attention model called X-Pool that reasons between a text and the frames of a video.
arXiv Detail & Related papers (2022-03-28T20:47:37Z) - BridgeFormer: Bridging Video-text Retrieval with Multiple Choice
Questions [38.843518809230524]
We introduce a novel pretext task dubbed Multiple Choice Questions (MCQ)
A module BridgeFormer is trained to answer the "questions" constructed by the text features via resorting to the video features.
In the form of questions and answers, the semantic associations between local video-text features can be properly established.
Our method outperforms state-of-the-art methods on the popular text-to-video retrieval task in five datasets.
arXiv Detail & Related papers (2022-01-13T09:33:54Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.