Beyond Accuracy: A Consolidated Tool for Visual Question Answering
Benchmarking
- URL: http://arxiv.org/abs/2110.05159v1
- Date: Mon, 11 Oct 2021 11:08:35 GMT
- Title: Beyond Accuracy: A Consolidated Tool for Visual Question Answering
Benchmarking
- Authors: Dirk V\"ath, Pascal Tilli and Ngoc Thang Vu
- Abstract summary: We propose a browser-based benchmarking tool for researchers and challenge organizers.
Our tool helps test generalization capabilities of models across multiple datasets.
Interactive filtering facilitates discovery of problematic behavior.
- Score: 30.155625852894797
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: On the way towards general Visual Question Answering (VQA) systems that are
able to answer arbitrary questions, the need arises for evaluation beyond
single-metric leaderboards for specific datasets. To this end, we propose a
browser-based benchmarking tool for researchers and challenge organizers, with
an API for easy integration of new models and datasets to keep up with the
fast-changing landscape of VQA. Our tool helps test generalization capabilities
of models across multiple datasets, evaluating not just accuracy, but also
performance in more realistic real-world scenarios such as robustness to input
noise. Additionally, we include metrics that measure biases and uncertainty, to
further explain model behavior. Interactive filtering facilitates discovery of
problematic behavior, down to the data sample level. As proof of concept, we
perform a case study on four models. We find that state-of-the-art VQA models
are optimized for specific tasks or datasets, but fail to generalize even to
other in-domain test sets, for example they cannot recognize text in images.
Our metrics allow us to quantify which image and question embeddings provide
most robustness to a model. All code is publicly available.
Related papers
- Dynamic Intelligence Assessment: Benchmarking LLMs on the Road to AGI with a Focus on Model Confidence [3.566250952750758]
We introduce the Dynamic Intelligence Assessment (DIA), a novel methodology for testing AI models.
Our framework introduces four new metrics to assess a model's reliability and confidence across multiple attempts.
The accompanying DIA-Bench dataset is presented in various formats such as text, PDFs, compiled binaries, and visual puzzles.
arXiv Detail & Related papers (2024-10-20T20:07:36Z) - LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers.
LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs.
We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - Chain-of-Skills: A Configurable Model for Open-domain Question Answering [79.8644260578301]
The retrieval model is an indispensable component for real-world knowledge-intensive tasks.
Recent work focuses on customized methods, limiting the model transferability and scalability.
We propose a modular retriever where individual modules correspond to key skills that can be reused across datasets.
arXiv Detail & Related papers (2023-05-04T20:19:39Z) - SimVQA: Exploring Simulated Environments for Visual Question Answering [15.030013924109118]
We explore using synthetic computer-generated data to fully control the visual and language space.
We quantify the effect of synthetic data in real-world VQA benchmarks and to which extent it produces results that generalize to real data.
We propose Feature Swapping (F-SWAP) -- where we randomly switch object-level features during training to make a VQA model more domain invariant.
arXiv Detail & Related papers (2022-03-31T17:44:27Z) - Challenges in Procedural Multimodal Machine Comprehension:A Novel Way To
Benchmark [14.50261153230204]
We focus on Multimodal Machine Reading (M3C) where a model is expected to answer questions based on given passage (or context)
We identify three critical biases stemming from the question-answer generation process and memorization capabilities of large deep models.
We propose a systematic framework to address these biases through three Control-Knobs.
arXiv Detail & Related papers (2021-10-22T16:33:57Z) - Human-Adversarial Visual Question Answering [62.30715496829321]
We benchmark state-of-the-art VQA models against human-adversarial examples.
We find that a wide range of state-of-the-art models perform poorly when evaluated on these examples.
arXiv Detail & Related papers (2021-06-04T06:25:32Z) - Generating Diverse and Consistent QA pairs from Contexts with
Information-Maximizing Hierarchical Conditional VAEs [62.71505254770827]
We propose a conditional variational autoencoder (HCVAE) for generating QA pairs given unstructured texts as contexts.
Our model obtains impressive performance gains over all baselines on both tasks, using only a fraction of data for training.
arXiv Detail & Related papers (2020-05-28T08:26:06Z) - Robust Question Answering Through Sub-part Alignment [53.94003466761305]
We model question answering as an alignment problem.
We train our model on SQuAD v1.1 and test it on several adversarial and out-of-domain datasets.
arXiv Detail & Related papers (2020-04-30T09:10:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.