Related papers: Towards a performance analysis on pre-trained Visual Question Answering models for autonomous driving

Towards a performance analysis on pre-trained Visual Question Answering models for autonomous driving

URL: http://arxiv.org/abs/2307.09329v2
Date: Fri, 28 Jul 2023 09:50:23 GMT
Title: Towards a performance analysis on pre-trained Visual Question Answering models for autonomous driving
Authors: Kaavya Rekanar, Ciar\'an Eising, Ganesh Sistu, Martin Hayes
Abstract summary: This paper presents a preliminary analysis of three popular Visual Question Answering (VQA) models, namely ViLBERT, ViLT, and LXMERT. The performance of these models is evaluated by comparing the similarity of responses to reference answers provided by computer vision experts.
Score: 2.9552300389898094
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This short paper presents a preliminary analysis of three popular Visual Question Answering (VQA) models, namely ViLBERT, ViLT, and LXMERT, in the context of answering questions relating to driving scenarios. The performance of these models is evaluated by comparing the similarity of responses to reference answers provided by computer vision experts. Model selection is predicated on the analysis of transformer utilization in multimodal architectures. The results indicate that models incorporating cross-modal attention and late fusion techniques exhibit promising potential for generating improved answers within a driving perspective. This initial analysis serves as a launchpad for a forthcoming comprehensive comparative study involving nine VQA models and sets the scene for further investigations into the effectiveness of VQA model queries in self-driving scenarios. Supplementary material is available at https://github.com/KaavyaRekanar/Towards-a-performance-analysis-on-pre-trained-VQA-models-for-autono mous-driving.

Related papers

DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding [76.3876070043663]
We propose DriveLMM-o1, a dataset and benchmark designed to advance step-wise visual reasoning for autonomous driving. Our benchmark features over 18k VQA examples in the training set and more than 4k in the test set, covering diverse questions on perception, prediction, and planning. Our model achieves a +7.49% gain in final answer accuracy, along with a 3.62% improvement in reasoning score over the previous best open-source model.
arXiv Detail & Related papers (2025-03-13T17:59:01Z)
Optimizing Visual Question Answering Models for Driving: Bridging the Gap Between Human and Machine Attention Patterns [1.3781842574516934]
This study investigates the attention patterns of humans compared to a VQA model when answering driving-related questions. We propose an approach integrating filters to optimize the model's attention mechanisms, prioritizing relevant objects and improving accuracy.
arXiv Detail & Related papers (2024-06-13T15:00:17Z)
Deciphering AutoML Ensembles: cattleia's Assistance in Decision-Making [0.0]
Cattleia is an application that deciphers the ensembles for regression, multiclass, and binary classification tasks. It works with models built by three AutoML packages: auto-sklearn, AutoGluon, and FLAML.
arXiv Detail & Related papers (2024-03-19T11:56:21Z)
An Empirical Comparison of LM-based Question and Answer Generation Methods [79.31199020420827]
Question and answer generation (QAG) consists of generating a set of question-answer pairs given a context. In this paper, we establish baselines with three different QAG methodologies that leverage sequence-to-sequence language model (LM) fine-tuning. Experiments show that an end-to-end QAG model, which is computationally light at both training and inference times, is generally robust and outperforms other more convoluted approaches.
arXiv Detail & Related papers (2023-05-26T14:59:53Z)
Towards a Unified Model for Generating Answers and Explanations in Visual Question Answering [11.754328280233628]
We argue that training explanation models independently of the QA model makes the explanations less grounded and limits performance. We propose a multitask learning approach towards a Unified Model for more grounded and consistent generation of both Answers and Explanations.
arXiv Detail & Related papers (2023-01-25T19:29:19Z)
Generative Bias for Robust Visual Question Answering [74.42555378660653]
We propose a generative method to train the bias model directly from the target model, called GenB. In particular, GenB employs a generative network to learn the bias in the target model through a combination of the adversarial objective and knowledge distillation. We show through extensive experiments the effects of our method on various VQA bias datasets including VQA-CP2, VQA-CP1, GQA-OOD, and VQA-CE.
arXiv Detail & Related papers (2022-08-01T08:58:02Z)
MetaQA: Combining Expert Agents for Multi-Skill Question Answering [49.35261724460689]
We argue that despite the promising results of multi-dataset models, some domains or QA formats might require specific architectures. We propose to combine expert agents with a novel, flexible, and training-efficient architecture that considers questions, answer predictions, and answer-prediction confidence scores.
arXiv Detail & Related papers (2021-12-03T14:05:52Z)
Counterfactual Samples Synthesizing and Training for Robust Visual Question Answering [59.20766562530209]
VQA models still tend to capture superficial linguistic correlations in the training set. Recent VQA works introduce an auxiliary question-only model to regularize the training of targeted VQA models. We propose a novel model-agnostic Counterfactual Samples Synthesizing and Training (CSST) strategy.
arXiv Detail & Related papers (2021-10-03T14:31:46Z)
Human-Adversarial Visual Question Answering [62.30715496829321]
We benchmark state-of-the-art VQA models against human-adversarial examples. We find that a wide range of state-of-the-art models perform poorly when evaluated on these examples.
arXiv Detail & Related papers (2021-06-04T06:25:32Z)
Counterfactual Samples Synthesizing for Robust Visual Question Answering [104.72828511083519]
We propose a model-agnostic Counterfactual Samples Synthesizing (CSS) training scheme. CSS generates numerous counterfactual training samples by masking critical objects in images or words in questions. We achieve a record-breaking performance of 58.95% on VQA-CP v2, with 6.5% gains.
arXiv Detail & Related papers (2020-03-14T08:34:31Z)
An LSTM-Based Autonomous Driving Model Using Waymo Open Dataset [7.151393153761375]
This paper introduces an approach to learn a short-term memory (LSTM)-based model for imitating the behavior of a self-driving model. The experimental results show that our model outperforms several models in driving action prediction.
arXiv Detail & Related papers (2020-02-14T05:28:15Z)
Accuracy vs. Complexity: A Trade-off in Visual Question Answering Models [39.338304913058685]
We study the trade-off between the model complexity and the performance on the Visual Question Answering task. We focus on the effect of "multi-modal fusion" in VQA models that is typically the most expensive step in a VQA pipeline.
arXiv Detail & Related papers (2020-01-20T11:27:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.