Towards a performance analysis on pre-trained Visual Question Answering
models for autonomous driving
- URL: http://arxiv.org/abs/2307.09329v2
- Date: Fri, 28 Jul 2023 09:50:23 GMT
- Title: Towards a performance analysis on pre-trained Visual Question Answering
models for autonomous driving
- Authors: Kaavya Rekanar, Ciar\'an Eising, Ganesh Sistu, Martin Hayes
- Abstract summary: This paper presents a preliminary analysis of three popular Visual Question Answering (VQA) models, namely ViLBERT, ViLT, and LXMERT.
The performance of these models is evaluated by comparing the similarity of responses to reference answers provided by computer vision experts.
- Score: 2.9552300389898094
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This short paper presents a preliminary analysis of three popular Visual
Question Answering (VQA) models, namely ViLBERT, ViLT, and LXMERT, in the
context of answering questions relating to driving scenarios. The performance
of these models is evaluated by comparing the similarity of responses to
reference answers provided by computer vision experts. Model selection is
predicated on the analysis of transformer utilization in multimodal
architectures. The results indicate that models incorporating cross-modal
attention and late fusion techniques exhibit promising potential for generating
improved answers within a driving perspective. This initial analysis serves as
a launchpad for a forthcoming comprehensive comparative study involving nine
VQA models and sets the scene for further investigations into the effectiveness
of VQA model queries in self-driving scenarios. Supplementary material is
available at
https://github.com/KaavyaRekanar/Towards-a-performance-analysis-on-pre-trained-VQA-models-for-autono mous-driving.
Related papers
- Optimizing Visual Question Answering Models for Driving: Bridging the Gap Between Human and Machine Attention Patterns [1.3781842574516934]
This study investigates the attention patterns of humans compared to a VQA model when answering driving-related questions.
We propose an approach integrating filters to optimize the model's attention mechanisms, prioritizing relevant objects and improving accuracy.
arXiv Detail & Related papers (2024-06-13T15:00:17Z) - Deciphering AutoML Ensembles: cattleia's Assistance in Decision-Making [0.0]
Cattleia is an application that deciphers the ensembles for regression, multiclass, and binary classification tasks.
It works with models built by three AutoML packages: auto-sklearn, AutoGluon, and FLAML.
arXiv Detail & Related papers (2024-03-19T11:56:21Z) - An Empirical Comparison of LM-based Question and Answer Generation
Methods [79.31199020420827]
Question and answer generation (QAG) consists of generating a set of question-answer pairs given a context.
In this paper, we establish baselines with three different QAG methodologies that leverage sequence-to-sequence language model (LM) fine-tuning.
Experiments show that an end-to-end QAG model, which is computationally light at both training and inference times, is generally robust and outperforms other more convoluted approaches.
arXiv Detail & Related papers (2023-05-26T14:59:53Z) - Towards a Unified Model for Generating Answers and Explanations in
Visual Question Answering [11.754328280233628]
We argue that training explanation models independently of the QA model makes the explanations less grounded and limits performance.
We propose a multitask learning approach towards a Unified Model for more grounded and consistent generation of both Answers and Explanations.
arXiv Detail & Related papers (2023-01-25T19:29:19Z) - Generative Bias for Robust Visual Question Answering [74.42555378660653]
We propose a generative method to train the bias model directly from the target model, called GenB.
In particular, GenB employs a generative network to learn the bias in the target model through a combination of the adversarial objective and knowledge distillation.
We show through extensive experiments the effects of our method on various VQA bias datasets including VQA-CP2, VQA-CP1, GQA-OOD, and VQA-CE.
arXiv Detail & Related papers (2022-08-01T08:58:02Z) - MetaQA: Combining Expert Agents for Multi-Skill Question Answering [49.35261724460689]
We argue that despite the promising results of multi-dataset models, some domains or QA formats might require specific architectures.
We propose to combine expert agents with a novel, flexible, and training-efficient architecture that considers questions, answer predictions, and answer-prediction confidence scores.
arXiv Detail & Related papers (2021-12-03T14:05:52Z) - Counterfactual Samples Synthesizing and Training for Robust Visual
Question Answering [59.20766562530209]
VQA models still tend to capture superficial linguistic correlations in the training set.
Recent VQA works introduce an auxiliary question-only model to regularize the training of targeted VQA models.
We propose a novel model-agnostic Counterfactual Samples Synthesizing and Training (CSST) strategy.
arXiv Detail & Related papers (2021-10-03T14:31:46Z) - Human-Adversarial Visual Question Answering [62.30715496829321]
We benchmark state-of-the-art VQA models against human-adversarial examples.
We find that a wide range of state-of-the-art models perform poorly when evaluated on these examples.
arXiv Detail & Related papers (2021-06-04T06:25:32Z) - Counterfactual Samples Synthesizing for Robust Visual Question Answering [104.72828511083519]
We propose a model-agnostic Counterfactual Samples Synthesizing (CSS) training scheme.
CSS generates numerous counterfactual training samples by masking critical objects in images or words in questions.
We achieve a record-breaking performance of 58.95% on VQA-CP v2, with 6.5% gains.
arXiv Detail & Related papers (2020-03-14T08:34:31Z) - An LSTM-Based Autonomous Driving Model Using Waymo Open Dataset [7.151393153761375]
This paper introduces an approach to learn a short-term memory (LSTM)-based model for imitating the behavior of a self-driving model.
The experimental results show that our model outperforms several models in driving action prediction.
arXiv Detail & Related papers (2020-02-14T05:28:15Z) - Accuracy vs. Complexity: A Trade-off in Visual Question Answering Models [39.338304913058685]
We study the trade-off between the model complexity and the performance on the Visual Question Answering task.
We focus on the effect of "multi-modal fusion" in VQA models that is typically the most expensive step in a VQA pipeline.
arXiv Detail & Related papers (2020-01-20T11:27:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.