Watching the News: Towards VideoQA Models that can Read
- URL: http://arxiv.org/abs/2211.05588v2
- Date: Thu, 7 Dec 2023 06:52:21 GMT
- Title: Watching the News: Towards VideoQA Models that can Read
- Authors: Soumya Jahagirdar, Minesh Mathew, Dimosthenis Karatzas, C. V. Jawahar
- Abstract summary: We argue that textual information is complementary to the action and provides essential contextualisation cues to the reasoning process.
We propose a novel VideoQA task that requires reading and understanding the text in the video.
We introduce the NewsVideoQA'' dataset that comprises more than $8,600$ QA pairs on $3,000+$ news videos obtained from diverse news channels from around the world.
- Score: 40.01623654896573
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video Question Answering methods focus on commonsense reasoning and visual
cognition of objects or persons and their interactions over time. Current
VideoQA approaches ignore the textual information present in the video.
Instead, we argue that textual information is complementary to the action and
provides essential contextualisation cues to the reasoning process. To this
end, we propose a novel VideoQA task that requires reading and understanding
the text in the video. To explore this direction, we focus on news videos and
require QA systems to comprehend and answer questions about the topics
presented by combining visual and textual cues in the video. We introduce the
``NewsVideoQA'' dataset that comprises more than $8,600$ QA pairs on $3,000+$
news videos obtained from diverse news channels from around the world. We
demonstrate the limitations of current Scene Text VQA and VideoQA methods and
propose ways to incorporate scene text information into VideoQA methods.
Related papers
- Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering [7.429984955853609]
We present Q-ViD, a simple approach for video question answering (video QA)
Q-ViD relies on a single instruction-aware open vision-language model (InstructBLIP) to tackle videoQA using frame descriptions.
arXiv Detail & Related papers (2024-02-16T13:59:07Z) - Understanding Video Scenes through Text: Insights from Text-based Video
Question Answering [40.01623654896573]
This paper explores two recently introduced datasets, NewsVideoQA and M4-ViteVQA, which aim to address video question answering based on textual content.
We provide an analysis of the formulation of these datasets on various levels, exploring the degree of visual understanding and multi-frame comprehension required for answering the questions.
arXiv Detail & Related papers (2023-09-04T06:11:00Z) - Reading Between the Lanes: Text VideoQA on the Road [27.923465943344723]
RoadTextVQA is a new dataset for the task of video question answering (VideoQA) in the context of driver assistance.
RoadTextVQA consists of $3,222$ driving videos collected from multiple countries, annotated with $10,500$ questions.
We assess the performance of state-of-the-art video question answering models on our RoadTextVQA dataset.
arXiv Detail & Related papers (2023-07-08T10:11:29Z) - Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions [30.650879247687747]
Video captioning to convey dynamic scenes from videos advances the understanding of using natural language.
In this work, we introduce Video ChatCaptioner, an innovative approach for creating more comprehensive video descriptions.
arXiv Detail & Related papers (2023-04-09T12:46:18Z) - TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation [55.83319599681002]
Text-VQA aims at answering questions that require understanding the textual cues in an image.
We develop a new method to generate high-quality and diverse QA pairs by explicitly utilizing the existing rich text available in the scene context of each image.
arXiv Detail & Related papers (2022-08-03T02:18:09Z) - Video Question Answering with Iterative Video-Text Co-Tokenization [77.66445727743508]
We propose a novel multi-stream video encoder for video question answering.
We experimentally evaluate the model on several datasets, such as MSRVTT-QA, MSVD-QA, IVQA.
Our model reduces the required GFLOPs from 150-360 to only 67, producing a highly efficient video question answering model.
arXiv Detail & Related papers (2022-08-01T15:35:38Z) - Structured Two-stream Attention Network for Video Question Answering [168.95603875458113]
We propose a Structured Two-stream Attention network, namely STA, to answer a free-form or open-ended natural language question.
First, we infer rich long-range temporal structures in videos using our structured segment component and encode text features.
Then, our structured two-stream attention component simultaneously localizes important visual instance, reduces the influence of background video and focuses on the relevant text.
arXiv Detail & Related papers (2022-06-02T12:25:52Z) - Video Question Answering: Datasets, Algorithms and Challenges [99.9179674610955]
Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos.
This paper provides a clear taxonomy and comprehensive analyses to VideoQA, focusing on the datasets, algorithms, and unique challenges.
arXiv Detail & Related papers (2022-03-02T16:34:09Z) - NEWSKVQA: Knowledge-Aware News Video Question Answering [5.720640816755851]
We explore a new frontier in video question answering: answering knowledge-based questions in the context of news videos.
We curate a new dataset of 12K news videos spanning across 156 hours with 1M multiple-choice question-answer pairs covering 8263 unique entities.
We propose a novel approach, NEWSKVQA which performs multi-modal inferencing over textual multiple-choice questions, videos, their transcripts and knowledge base.
arXiv Detail & Related papers (2022-02-08T17:31:31Z) - Look Before you Speak: Visually Contextualized Utterances [88.58909442073858]
We create a task for predicting utterances in a video using both visual frames and transcribed speech as context.
By exploiting the large number of instructional videos online, we train a model to solve this task at scale, without the need for manual annotations.
Our model achieves state-of-the-art performance on a number of downstream VideoQA benchmarks.
arXiv Detail & Related papers (2020-12-10T14:47:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.