NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for
Autonomous Driving Scenario
- URL: http://arxiv.org/abs/2305.14836v2
- Date: Tue, 20 Feb 2024 05:04:58 GMT
- Title: NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for
Autonomous Driving Scenario
- Authors: Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, Yu-Gang Jiang
- Abstract summary: NuScenesQA is the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs.
We leverage existing 3D detection annotations to generate scene graphs and design question templates manually.
We develop a series of baselines that employ advanced 3D detection and VQA techniques.
- Score: 77.14723238359318
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a novel visual question answering (VQA) task in the context of
autonomous driving, aiming to answer natural language questions based on
street-view clues. Compared to traditional VQA tasks, VQA in autonomous driving
scenario presents more challenges. Firstly, the raw visual data are
multi-modal, including images and point clouds captured by camera and LiDAR,
respectively. Secondly, the data are multi-frame due to the continuous,
real-time acquisition. Thirdly, the outdoor scenes exhibit both moving
foreground and static background. Existing VQA benchmarks fail to adequately
address these complexities. To bridge this gap, we propose NuScenes-QA, the
first benchmark for VQA in the autonomous driving scenario, encompassing 34K
visual scenes and 460K question-answer pairs. Specifically, we leverage
existing 3D detection annotations to generate scene graphs and design question
templates manually. Subsequently, the question-answer pairs are generated
programmatically based on these templates. Comprehensive statistics prove that
our NuScenes-QA is a balanced large-scale benchmark with diverse question
formats. Built upon it, we develop a series of baselines that employ advanced
3D detection and VQA techniques. Our extensive experiments highlight the
challenges posed by this new task. Codes and dataset are available at
https://github.com/qiantianwen/NuScenes-QA.
Related papers
- NuScenes-MQA: Integrated Evaluation of Captions and QA for Autonomous
Driving Datasets using Markup Annotations [0.6827423171182154]
Visual Question Answering (VQA) is one of the most important tasks in autonomous driving.
We introduce a novel dataset annotation technique in which QAs are enclosed within markups.
This dataset empowers the development of vision language models, especially for autonomous driving tasks.
arXiv Detail & Related papers (2023-12-11T12:58:54Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - ANetQA: A Large-scale Benchmark for Fine-grained Compositional Reasoning
over Untrimmed Videos [120.80589215132322]
We present ANetQA, a large-scale benchmark that supports fine-grained compositional reasoning over challenging untrimmed videos from ActivityNet.
ANetQA attains 1.4 billion unbalanced and 13.4 million balanced QA pairs, which is an order of magnitude larger than AGQA with a similar number of videos.
The best model achieves 44.5% accuracy while human performance tops out at 84.5%, leaving sufficient room for improvement.
arXiv Detail & Related papers (2023-05-04T03:04:59Z) - Toward Unsupervised Realistic Visual Question Answering [70.67698100148414]
We study the problem of realistic VQA (RVQA), where a model has to reject unanswerable questions (UQs) and answer answerable ones (AQs)
We first point out 2 drawbacks in current RVQA research, where (1) datasets contain too many unchallenging UQs and (2) a large number of annotated UQs are required for training.
We propose a new testing dataset, RGQA, which combines AQs from an existing VQA dataset with around 29K human-annotated UQs.
This combines pseudo UQs obtained by randomly pairing images and questions, with an
arXiv Detail & Related papers (2023-03-09T06:58:29Z) - HRVQA: A Visual Question Answering Benchmark for High-Resolution Aerial
Images [18.075338835513993]
We introduce a new dataset, HRVQA, which provides collected 53512 aerial images of 1024*1024 pixels and 1070240 QA pairs.
To benchmark the understanding capability of VQA models for aerial images, we evaluate the relevant methods on HRVQA.
Our method achieves superior performance in comparison to the previous state-of-the-art approaches.
arXiv Detail & Related papers (2023-01-23T14:36:38Z) - Few-Shot Visual Question Generation: A Novel Task and Benchmark Datasets [5.45761450227064]
We propose a new Few-Shot Visual Question Generation (FS-VQG) task and provide a comprehensive benchmark to it.
We evaluate various existing VQG approaches as well as popular few-shot solutions based on meta-learning and self-supervised strategies for the FS-VQG task.
Several important findings emerge from our experiments, that shed light on the limits of current models in few-shot vision and language generation tasks.
arXiv Detail & Related papers (2022-10-13T15:01:15Z) - From Easy to Hard: Learning Language-guided Curriculum for Visual
Question Answering on Remote Sensing Data [27.160303686163164]
Visual question answering (VQA) for remote sensing scene has great potential in intelligent human-computer interaction system.
No object annotations are available in RSVQA datasets, which makes it difficult for models to exploit informative region representation.
There are questions with clearly different difficulty levels for each image in the RSVQA task.
A multi-level visual feature learning method is proposed to jointly extract language-guided holistic and regional image features.
arXiv Detail & Related papers (2022-05-06T11:37:00Z) - Learning to Answer Questions in Dynamic Audio-Visual Scenarios [81.19017026999218]
We focus on the Audio-Visual Questioning (AVQA) task, which aims to answer questions regarding different visual objects sounds, and their associations in videos.
Our dataset contains more than 45K question-answer pairs spanning over different modalities and question types.
Our results demonstrate that AVQA benefits from multisensory perception and our model outperforms recent A-SIC, V-SIC, and AVQA approaches.
arXiv Detail & Related papers (2022-03-26T13:03:42Z) - NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions [80.60423934589515]
We introduce NExT-QA, a rigorously designed video question answering (VideoQA) benchmark.
We set up multi-choice and open-ended QA tasks targeting causal action reasoning, temporal action reasoning, and common scene comprehension.
We find that top-performing methods excel at shallow scene descriptions but are weak in causal and temporal action reasoning.
arXiv Detail & Related papers (2021-05-18T04:56:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.