Improving Automatic VQA Evaluation Using Large Language Models
- URL: http://arxiv.org/abs/2310.02567v2
- Date: Wed, 10 Jan 2024 17:00:05 GMT
- Title: Improving Automatic VQA Evaluation Using Large Language Models
- Authors: Oscar Ma\~nas, Benno Krojer, Aishwarya Agrawal
- Abstract summary: We propose to leverage the in-context learning capabilities of instruction-tuned large language models to build a better VQA metric.
We demonstrate the proposed metric better correlates with human judgment compared to existing metrics across several VQA models and benchmarks.
- Score: 6.468405905503242
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 8 years after the visual question answering (VQA) task was proposed, accuracy
remains the primary metric for automatic evaluation. VQA Accuracy has been
effective so far in the IID evaluation setting. However, our community is
undergoing a shift towards open-ended generative models and OOD evaluation. In
this new paradigm, the existing VQA Accuracy metric is overly stringent and
underestimates the performance of VQA systems. Thus, there is a need to develop
more robust automatic VQA metrics that serve as a proxy for human judgment. In
this work, we propose to leverage the in-context learning capabilities of
instruction-tuned large language models (LLMs) to build a better VQA metric. We
formulate VQA evaluation as an answer-rating task where the LLM is instructed
to score the accuracy of a candidate answer given a set of reference answers.
We demonstrate the proposed metric better correlates with human judgment
compared to existing metrics across several VQA models and benchmarks. We hope
wide adoption of our metric will contribute to better estimating the research
progress on the VQA task. We plan to release the evaluation code and collected
human judgments.
Related papers
- Towards Flexible Evaluation for Generative Visual Question Answering [17.271448204525612]
This paper proposes the use of semantics-based evaluators for assessing unconstrained open-ended responses on Visual Question Answering (VQA) datasets.
In addition, this paper proposes a Semantically Flexible VQA Evaluator (SFVE) with meticulous design based on the unique features of VQA evaluation.
arXiv Detail & Related papers (2024-08-01T05:56:34Z) - SQUARE: Automatic Question Answering Evaluation using Multiple Positive
and Negative References [73.67707138779245]
We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation)
We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems.
arXiv Detail & Related papers (2023-09-21T16:51:30Z) - Evaluating Open-QA Evaluation [29.43815593419996]
This study focuses on the evaluation of the Open Question Answering (Open-QA) task, which can directly estimate the factuality of large language models (LLMs)
We introduce a new task, Evaluating QA Evaluation (QA-Eval) and the corresponding dataset EVOUNA, designed to assess the accuracy of AI-generated answers in relation to standard answers within Open-QA.
arXiv Detail & Related papers (2023-05-21T10:40:55Z) - QAScore -- An Unsupervised Unreferenced Metric for the Question
Generation Evaluation [6.697751970080859]
Question Generation (QG) aims to automate the task of composing questions for a passage with a set of chosen answers.
We propose a new reference-free evaluation metric that has the potential to provide a better mechanism for evaluating QG systems, called QAScore.
arXiv Detail & Related papers (2022-10-09T19:00:39Z) - QAFactEval: Improved QA-Based Factual Consistency Evaluation for
Summarization [116.56171113972944]
We show that carefully choosing the components of a QA-based metric is critical to performance.
Our solution improves upon the best-performing entailment-based metric and achieves state-of-the-art performance.
arXiv Detail & Related papers (2021-12-16T00:38:35Z) - Adversarial VQA: A New Benchmark for Evaluating the Robustness of VQA
Models [45.777326168922635]
We introduce Adversarial VQA, a new large-scale VQA benchmark, collected iteratively via an adversarial human-and-model-in-the-loop procedure.
We find that non-expert annotators can successfully attack SOTA VQA models with relative ease.
Both large-scale pre-trained models and adversarial training methods can only achieve far lower performance than what they can achieve on the standard VQA v2 dataset.
arXiv Detail & Related papers (2021-06-01T05:54:41Z) - Learning from Lexical Perturbations for Consistent Visual Question
Answering [78.21912474223926]
Existing Visual Question Answering (VQA) models are often fragile and sensitive to input variations.
We propose a novel approach to address this issue based on modular networks, which creates two questions related by linguistic perturbations.
We also present VQA Perturbed Pairings (VQA P2), a new, low-cost benchmark and augmentation pipeline to create controllable linguistic variations.
arXiv Detail & Related papers (2020-11-26T17:38:03Z) - Contrast and Classify: Training Robust VQA Models [60.80627814762071]
We propose a novel training paradigm (ConClaT) that optimize both cross-entropy and contrastive losses.
We find that optimizing both losses -- either alternately or jointly -- is key to effective training.
arXiv Detail & Related papers (2020-10-13T00:23:59Z) - KPQA: A Metric for Generative Question Answering Using Keyphrase Weights [64.54593491919248]
KPQA-metric is a new metric for evaluating correctness of generative question answering systems.
Our new metric assigns different weights to each token via keyphrase prediction.
We show that our proposed metric has a significantly higher correlation with human judgments than existing metrics.
arXiv Detail & Related papers (2020-05-01T03:24:36Z) - Template-Based Question Generation from Retrieved Sentences for Improved
Unsupervised Question Answering [98.48363619128108]
We propose an unsupervised approach to training QA models with generated pseudo-training data.
We show that generating questions for QA training by applying a simple template on a related, retrieved sentence rather than the original context sentence improves downstream QA performance.
arXiv Detail & Related papers (2020-04-24T17:57:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.