Understanding Video Scenes through Text: Insights from Text-based Video
Question Answering
- URL: http://arxiv.org/abs/2309.01380v2
- Date: Mon, 11 Sep 2023 07:01:24 GMT
- Title: Understanding Video Scenes through Text: Insights from Text-based Video
Question Answering
- Authors: Soumya Jahagirdar, Minesh Mathew, Dimosthenis Karatzas, C. V. Jawahar
- Abstract summary: This paper explores two recently introduced datasets, NewsVideoQA and M4-ViteVQA, which aim to address video question answering based on textual content.
We provide an analysis of the formulation of these datasets on various levels, exploring the degree of visual understanding and multi-frame comprehension required for answering the questions.
- Score: 40.01623654896573
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Researchers have extensively studied the field of vision and language,
discovering that both visual and textual content is crucial for understanding
scenes effectively. Particularly, comprehending text in videos holds great
significance, requiring both scene text understanding and temporal reasoning.
This paper focuses on exploring two recently introduced datasets, NewsVideoQA
and M4-ViteVQA, which aim to address video question answering based on textual
content. The NewsVideoQA dataset contains question-answer pairs related to the
text in news videos, while M4-ViteVQA comprises question-answer pairs from
diverse categories like vlogging, traveling, and shopping. We provide an
analysis of the formulation of these datasets on various levels, exploring the
degree of visual understanding and multi-frame comprehension required for
answering the questions. Additionally, the study includes experimentation with
BERT-QA, a text-only model, which demonstrates comparable performance to the
original methods on both datasets, indicating the shortcomings in the
formulation of these datasets. Furthermore, we also look into the domain
adaptation aspect by examining the effectiveness of training on M4-ViteVQA and
evaluating on NewsVideoQA and vice-versa, thereby shedding light on the
challenges and potential benefits of out-of-domain training.
Related papers
- Making the V in Text-VQA Matter [1.2962828085662563]
Text-based VQA aims at answering questions by reading the text present in the images.
Recent studies have shown that the question-answer pairs in the dataset are more focused on the text present in the image.
The models trained on this dataset predict biased answers due to the lack of understanding of visual context.
arXiv Detail & Related papers (2023-08-01T05:28:13Z) - A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension [49.74647080936875]
We introduce a large-scale and cross-modal Video Retrieval dataset with text reading comprehension, TextVR.
The proposed TextVR requires one unified cross-modal model to recognize and comprehend texts, relate them to the visual context, and decide what text semantic information is vital for the video retrieval task.
arXiv Detail & Related papers (2023-05-05T08:00:14Z) - Deep Learning for Video-Text Retrieval: a Review [13.341694455581363]
Video-Text Retrieval (VTR) aims to search for the most relevant video related to the semantics in a given sentence.
In this survey, we review and summarize over 100 research papers related to VTR.
arXiv Detail & Related papers (2023-02-24T10:14:35Z) - Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval? [131.300931102986]
In real-world scenarios, online videos are often accompanied by relevant text information such as titles, tags, and even subtitles.
We propose a novel approach to text-video retrieval, where we directly generate associated captions from videos using zero-shot video captioning.
We conduct comprehensive ablation studies to demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2022-12-31T11:50:32Z) - Understanding ME? Multimodal Evaluation for Fine-grained Visual
Commonsense [98.70218717851665]
It is unclear whether the models really understand the visual scene and underlying commonsense knowledge due to limited evaluation data resources.
We present a Multimodal Evaluation (ME) pipeline to automatically generate question-answer pairs to test models' understanding of the visual scene, text, and related knowledge.
We then take a step further to show that training with the ME data boosts the model's performance in standard VCR evaluation.
arXiv Detail & Related papers (2022-11-10T21:44:33Z) - TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation [55.83319599681002]
Text-VQA aims at answering questions that require understanding the textual cues in an image.
We develop a new method to generate high-quality and diverse QA pairs by explicitly utilizing the existing rich text available in the scene context of each image.
arXiv Detail & Related papers (2022-08-03T02:18:09Z) - Structured Two-stream Attention Network for Video Question Answering [168.95603875458113]
We propose a Structured Two-stream Attention network, namely STA, to answer a free-form or open-ended natural language question.
First, we infer rich long-range temporal structures in videos using our structured segment component and encode text features.
Then, our structured two-stream attention component simultaneously localizes important visual instance, reduces the influence of background video and focuses on the relevant text.
arXiv Detail & Related papers (2022-06-02T12:25:52Z) - Bridging Vision and Language from the Video-to-Text Perspective: A
Comprehensive Review [1.0520692160489133]
This review categorizes and describes the state-of-the-art techniques for the video-to-text problem.
It covers the main video-to-text methods and the ways to evaluate their performance.
State-of-the-art techniques are still a long way from achieving human-like performance in generating or retrieving video descriptions.
arXiv Detail & Related papers (2021-03-27T02:12:28Z) - Multimodal grid features and cell pointers for Scene Text Visual
Question Answering [7.834170106487722]
This paper presents a new model for the task of scene text visual question answering.
It is based on an attention mechanism that attends to multi-modal features conditioned to the question.
Experiments demonstrate competitive performance in two standard datasets.
arXiv Detail & Related papers (2020-06-01T13:17:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.