LingoQA: Visual Question Answering for Autonomous Driving
- URL: http://arxiv.org/abs/2312.14115v4
- Date: Thu, 26 Sep 2024 15:30:00 GMT
- Title: LingoQA: Visual Question Answering for Autonomous Driving
- Authors: Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, Elahe Arani, Oleg Sinavski,
- Abstract summary: We introduce LingoQA, a novel dataset and benchmark for visual question answering in autonomous driving.
The dataset contains 28K unique short video scenarios, and 419K annotations.
On our benchmark, vision-language models respond truthfully to 59.6% of the questions compared to 96.6% for humans.
- Score: 14.620546951115328
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce LingoQA, a novel dataset and benchmark for visual question answering in autonomous driving. The dataset contains 28K unique short video scenarios, and 419K annotations. Evaluating state-of-the-art vision-language models on our benchmark shows that their performance is below human capabilities, with GPT-4V responding truthfully to 59.6% of the questions compared to 96.6% for humans. For evaluation, we propose a truthfulness classifier, called Lingo-Judge, that achieves a 0.95 Spearman correlation coefficient to human evaluations, surpassing existing techniques like METEOR, BLEU, CIDEr, and GPT-4. We establish a baseline vision-language model and run extensive ablation studies to understand its performance. We release our dataset and benchmark as an evaluation platform for vision-language models in autonomous driving.
Related papers
- Unlock the Power of Unlabeled Data in Language Driving Model [23.648749606793118]
We build a strong Language Driving Model (LDM) for driving scene question-answering, outperforming previous state-of-the-art methods.
Our LDM achieves 44.85% performance with limited labeled data, increasing to 54.27% when using unlabeled data, while models trained with full datasets reach 60.68% on the DriveLM benchmark.
arXiv Detail & Related papers (2025-03-13T17:36:36Z) - Evaluation of Safety Cognition Capability in Vision-Language Models for Autonomous Driving [10.01820885669991]
We propose a novel evaluation method: Safety Cognitive Driving Benchmark (SCD-Bench)
To address the large-scale annotation challenge for SCD-Bench, we develop the Autonomous Driving Image-Text.
System.
Preliminary experimental results indicate that existing open-source models still lack sufficient safety cognition.
arXiv Detail & Related papers (2025-03-09T07:53:19Z) - Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models [67.62126108440003]
We introduce Vibe-Eval: a new open benchmark and framework for evaluating multimodal chat models.
Vibe-Eval consists of 269 visual understanding prompts, including 100 of hard difficulty, complete with gold-standard responses authored by experts.
We discuss trade-offs between human and automatic evaluation, and show that automatic model evaluation using Reka Core roughly correlates to human judgment.
arXiv Detail & Related papers (2024-05-03T17:59:55Z) - Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy [27.454549324141087]
We propose a novel VQA benchmark based on well-known visual classification datasets.
We also suggest using the semantic hierarchy of the label space to ask automatically generated follow-up questions about the ground-truth category.
Our contributions aim to lay the foundation for more precise and meaningful assessments.
arXiv Detail & Related papers (2024-02-11T18:26:18Z) - AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering [6.088350050879401]
We propose a novel and challenging benchmark, AutoEval-Video, to comprehensively evaluate large vision-language models in open-ended video question answering.
The comprehensiveness of AutoEval-Video is demonstrated in two aspects: 1) AutoEval-Video constructs open-ended video-questions across 9 skill dimensions, addressing capabilities of perception, comprehension, and generation.
By using instance-specific rules as prompt, GPT-4, as an automatic evaluator, can achieve a stable evaluation accuracy of around 97.0%, comparable to the 94.9% - 97.5% accuracy of a human evaluator.
arXiv Detail & Related papers (2023-11-25T02:46:12Z) - VisIT-Bench: A Benchmark for Vision-Language Instruction Following
Inspired by Real-World Use [49.574651930395305]
VisIT-Bench is a benchmark for evaluation of instruction-following vision-language models.
Our dataset comprises 592 test queries, each with a human-authored instruction-conditioned caption.
We quantify quality gaps between models and references using both human and automatic evaluations.
arXiv Detail & Related papers (2023-08-12T15:27:51Z) - How Far Can Camels Go? Exploring the State of Instruction Tuning on Open
Resources [117.6496550359768]
This work explores recent advances in instruction-tuning language models on a range of open instruction-following datasets.
We provide a large set of instruction-tuned models from 6.7B to 65B parameters in size, trained on 12 instruction datasets.
We evaluate them on their factual knowledge, reasoning, multilinguality, coding, and open-ended instruction following abilities.
arXiv Detail & Related papers (2023-06-07T19:59:23Z) - Perception Test: A Diagnostic Benchmark for Multimodal Video Models [78.64546291816117]
We propose a novel multimodal video benchmark to evaluate the perception and reasoning skills of pre-trained multimodal models.
The Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning (descriptive, explanatory, predictive, counterfactual) across video, audio, and text modalities.
The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or limited finetuning regime.
arXiv Detail & Related papers (2023-05-23T07:54:37Z) - Improving Visual Grounding by Encouraging Consistent Gradient-based
Explanations [58.442103936918805]
We show that Attention Mask Consistency produces superior visual grounding results than previous methods.
AMC is effective, easy to implement, and is general as it can be adopted by any vision-language model.
arXiv Detail & Related papers (2022-06-30T17:55:12Z) - Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring
Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics.
We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.