Reading Between the Lanes: Text VideoQA on the Road
- URL: http://arxiv.org/abs/2307.03948v1
- Date: Sat, 8 Jul 2023 10:11:29 GMT
- Title: Reading Between the Lanes: Text VideoQA on the Road
- Authors: George Tom, Minesh Mathew, Sergi Garcia, Dimosthenis Karatzas and C.V.
Jawahar
- Abstract summary: RoadTextVQA is a new dataset for the task of video question answering (VideoQA) in the context of driver assistance.
RoadTextVQA consists of $3,222$ driving videos collected from multiple countries, annotated with $10,500$ questions.
We assess the performance of state-of-the-art video question answering models on our RoadTextVQA dataset.
- Score: 27.923465943344723
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text and signs around roads provide crucial information for drivers, vital
for safe navigation and situational awareness. Scene text recognition in motion
is a challenging problem, while textual cues typically appear for a short time
span, and early detection at a distance is necessary. Systems that exploit such
information to assist the driver should not only extract and incorporate visual
and textual cues from the video stream but also reason over time. To address
this issue, we introduce RoadTextVQA, a new dataset for the task of video
question answering (VideoQA) in the context of driver assistance. RoadTextVQA
consists of $3,222$ driving videos collected from multiple countries, annotated
with $10,500$ questions, all based on text or road signs present in the driving
videos. We assess the performance of state-of-the-art video question answering
models on our RoadTextVQA dataset, highlighting the significant potential for
improvement in this domain and the usefulness of the dataset in advancing
research on in-vehicle support systems and text-aware multimodal question
answering. The dataset is available at
http://cvit.iiit.ac.in/research/projects/cvit-projects/roadtextvqa
Related papers
- Scene-Text Grounding for Text-Based Video Question Answering [97.1112579979614]
Existing efforts in text-based video question answering (TextVideoQA) are criticized for their opaque decisionmaking and reliance on scene-text recognition.
We study Grounded TextVideoQA by forcing models to answer questions and interpret relevant scene-text regions.
arXiv Detail & Related papers (2024-09-22T05:13:11Z) - Dataset and Benchmark for Urdu Natural Scenes Text Detection, Recognition and Visual Question Answering [50.52792174648067]
This initiative seeks to bridge the gap between textual and visual comprehension.
We propose a new multi-task Urdu scene text dataset comprising over 1000 natural scene images.
We provide fine-grained annotations for text instances, addressing the limitations of previous datasets.
arXiv Detail & Related papers (2024-05-21T06:48:26Z) - DSText V2: A Comprehensive Video Text Spotting Dataset for Dense and
Small Text [46.177941541282756]
We establish a video text reading benchmark, named DSText V2, which focuses on Dense and Small text reading challenges in the video.
Compared with the previous datasets, the proposed dataset mainly include three new challenges.
High-proportioned small texts, coupled with the blurriness and distortion in the video, will bring further challenges.
arXiv Detail & Related papers (2023-11-29T09:13:27Z) - Understanding Video Scenes through Text: Insights from Text-based Video
Question Answering [40.01623654896573]
This paper explores two recently introduced datasets, NewsVideoQA and M4-ViteVQA, which aim to address video question answering based on textual content.
We provide an analysis of the formulation of these datasets on various levels, exploring the degree of visual understanding and multi-frame comprehension required for answering the questions.
arXiv Detail & Related papers (2023-09-04T06:11:00Z) - A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension [49.74647080936875]
We introduce a large-scale and cross-modal Video Retrieval dataset with text reading comprehension, TextVR.
The proposed TextVR requires one unified cross-modal model to recognize and comprehend texts, relate them to the visual context, and decide what text semantic information is vital for the video retrieval task.
arXiv Detail & Related papers (2023-05-05T08:00:14Z) - ICDAR 2023 Video Text Reading Competition for Dense and Small Text [61.138557702185274]
We establish a video text reading benchmark, DSText, which focuses on dense and small text reading challenges in the video.
Compared with the previous datasets, the proposed dataset mainly include three new challenges.
The proposed DSText includes 100 video clips from 12 open scenarios, supporting two tasks (i.e., video text tracking (Task 1) and end-to-end video text spotting (Task 2)
arXiv Detail & Related papers (2023-04-10T04:20:34Z) - Watching the News: Towards VideoQA Models that can Read [40.01623654896573]
We argue that textual information is complementary to the action and provides essential contextualisation cues to the reasoning process.
We propose a novel VideoQA task that requires reading and understanding the text in the video.
We introduce the NewsVideoQA'' dataset that comprises more than $8,600$ QA pairs on $3,000+$ news videos obtained from diverse news channels from around the world.
arXiv Detail & Related papers (2022-11-10T13:58:38Z) - TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation [55.83319599681002]
Text-VQA aims at answering questions that require understanding the textual cues in an image.
We develop a new method to generate high-quality and diverse QA pairs by explicitly utilizing the existing rich text available in the scene context of each image.
arXiv Detail & Related papers (2022-08-03T02:18:09Z) - Structured Two-stream Attention Network for Video Question Answering [168.95603875458113]
We propose a Structured Two-stream Attention network, namely STA, to answer a free-form or open-ended natural language question.
First, we infer rich long-range temporal structures in videos using our structured segment component and encode text features.
Then, our structured two-stream attention component simultaneously localizes important visual instance, reduces the influence of background video and focuses on the relevant text.
arXiv Detail & Related papers (2022-06-02T12:25:52Z) - RoadText-1K: Text Detection & Recognition Dataset for Driving Videos [26.614671477004375]
This paper introduces a new "RoadText-1K" dataset for text in driving videos.
The dataset is 20 times larger than the existing largest dataset for text in videos.
arXiv Detail & Related papers (2020-05-19T14:51:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.