Related papers: Towards Flexible Evaluation for Generative Visual Question Answering

Towards Flexible Evaluation for Generative Visual Question Answering

URL: http://arxiv.org/abs/2408.00300v1
Date: Thu, 1 Aug 2024 05:56:34 GMT
Title: Towards Flexible Evaluation for Generative Visual Question Answering
Authors: Huishan Ji, Qingyi Si, Zheng Lin, Weiping Wang,
Abstract summary: This paper proposes the use of semantics-based evaluators for assessing unconstrained open-ended responses on Visual Question Answering (VQA) datasets. In addition, this paper proposes a Semantically Flexible VQA Evaluator (SFVE) with meticulous design based on the unique features of VQA evaluation.
Score: 17.271448204525612
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Throughout rapid development of multimodal large language models, a crucial ingredient is a fair and accurate evaluation of their multimodal comprehension abilities. Although Visual Question Answering (VQA) could serve as a developed test field, limitations of VQA evaluation, like the inflexible pattern of Exact Match, have hindered MLLMs from demonstrating their real capability and discourage rich responses. Therefore, this paper proposes the use of semantics-based evaluators for assessing unconstrained open-ended responses on VQA datasets. As characteristics of VQA have made such evaluation significantly different than the traditional Semantic Textual Similarity (STS) task, to systematically analyze the behaviour and compare the performance of various evaluators including LLM-based ones, we proposes three key properties, i.e., Alignment, Consistency and Generalization, and a corresponding dataset Assessing VQA Evaluators (AVE) to facilitate analysis. In addition, this paper proposes a Semantically Flexible VQA Evaluator (SFVE) with meticulous design based on the unique features of VQA evaluation. Experimental results verify the feasibility of model-based VQA evaluation and effectiveness of the proposed evaluator that surpasses existing semantic evaluators by a large margin. The proposed training scheme generalizes to both the BERT-like encoders and decoder-only LLM.

Related papers

Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering [78.89231943329885]
One of the most widely used tasks to evaluate Large Language Models (LLMs) is Multiple-Choice Question Answering (MCQA) In this work, we shed light on the inconsistencies of MCQA evaluation strategies, which can lead to inaccurate and misleading model comparisons.
arXiv Detail & Related papers (2025-03-19T08:45:03Z)
Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework [61.38174427966444]
Large Language Models (LLMs) are being used more and more extensively for automated evaluation in various scenarios. Previous studies have attempted to fine-tune open-source LLMs to replicate the evaluation explanations and judgments of powerful proprietary models. We propose a novel evaluation framework, ARJudge, that adaptively formulates evaluation criteria and synthesizes both text-based and code-driven analyses.
arXiv Detail & Related papers (2025-02-26T06:31:45Z)
OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain [62.89809156574998]
We introduce an omnidirectional and automatic RAG benchmark, OmniEval, in the financial domain. Our benchmark is characterized by its multi-dimensional evaluation framework. Our experiments demonstrate the comprehensiveness of OmniEval, which includes extensive test datasets.
arXiv Detail & Related papers (2024-12-17T15:38:42Z)
AutoBench-V: Can Large Vision-Language Models Benchmark Themselves? [55.14033256706175]
Large Vision-Language Models (LVLMs) have become essential for advancing the integration of visual and linguistic information. We introduce AutoBench-V, an automated framework for serving evaluation on demand. Through an extensive evaluation of seven popular LVLMs across five demanded user inputs, the framework shows effectiveness and reliability.
arXiv Detail & Related papers (2024-10-28T17:55:08Z)
Trust but Verify: Programmatic VLM Evaluation in the Wild [62.14071929143684]
Programmatic VLM Evaluation (PROVE) is a new benchmarking paradigm for evaluating VLM responses to open-ended queries. We benchmark the helpfulness-truthfulness trade-offs of a range ofVLMs on PROVE, finding that very few are in-fact able to achieve a good balance between the two.
arXiv Detail & Related papers (2024-10-17T01:19:18Z)
Revisiting Benchmark and Assessment: An Agent-based Exploratory Dynamic Evaluation Framework for LLMs [29.72874725703848]
We introduce two concepts: Benchmark+, which extends traditional question-answer benchmark into a more flexible "strategy-criterion" format; and Assessment+, which enhances the interaction process. We propose an agent-based evaluation framework called TestAgent, which implements these concepts through retrieval augmented generation and reinforcement learning.
arXiv Detail & Related papers (2024-10-15T11:20:42Z)
Single Ground Truth Is Not Enough: Add Linguistic Variability to Aspect-based Sentiment Analysis Evaluation [41.66053021998106]
Aspect-based sentiment analysis (ABSA) is the challenging task of extracting sentiment along with its corresponding aspects and opinions from human language. Current evaluation methods for this task often restrict answers to a single ground truth, penalizing semantically equivalent predictions that differ in surface form. We propose a novel, fully automated pipeline that augments existing test sets with alternative valid responses for aspect and opinion terms.
arXiv Detail & Related papers (2024-10-13T11:48:09Z)
Weak-eval-Strong: Evaluating and Eliciting Lateral Thinking of LLMs with Situation Puzzles [20.18736445118689]
We introduce SPLAT, a benchmark leveraging Situation Puzzles to evaluate and elicit lateral thinking of Large Language Models (LLMs) This benchmark, containing 975 graded situation puzzles across three difficulty levels, employs a new multi-turn player-judge framework instead of the traditional model-based evaluation. Experiments demonstrate that a robust evaluation model, such as WizardLM-2, closely matches human judgements in both intermediate question-answering and final scenario accuracy.
arXiv Detail & Related papers (2024-10-09T10:09:11Z)
HGOT: Hierarchical Graph of Thoughts for Retrieval-Augmented In-Context Learning in Factuality Evaluation [20.178644251662316]
We introduce the hierarchical graph of thoughts (HGOT) to enhance the retrieval of pertinent passages during in-context learning. The framework employs the divide-and-conquer strategy to break down complex queries into manageable sub-queries. It refines self-consistency majority voting for answer selection, which incorporates the recently proposed citation recall and precision metrics.
arXiv Detail & Related papers (2024-02-14T18:41:19Z)
PROXYQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models [72.57329554067195]
ProxyQA is an innovative framework dedicated to assessing longtext generation. It comprises in-depth human-curated meta-questions spanning various domains, each accompanied by specific proxy-questions with pre-annotated answers. It assesses the generated content's quality through the evaluator's accuracy in addressing the proxy-questions.
arXiv Detail & Related papers (2024-01-26T18:12:25Z)
Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels [95.44077384918725]
We propose to teach large multi-modality models (LMMs) with text-defined rating levels instead of scores. The proposed Q-Align achieves state-of-the-art performance on image quality assessment (IQA), image aesthetic assessment (IAA) and video quality assessment (VQA) tasks.
arXiv Detail & Related papers (2023-12-28T16:10:25Z)
KNVQA: A Benchmark for evaluation knowledge-based VQA [8.602776661652083]
Large vision-language models (LVLMs) have made significant progress due to their strong perception and reasoning capabilities in the visual and language systems. LVLMs are still plagued by the two critical issues of object hallucination and factual accuracy, which limit the practicality of LVLMs in different scenarios. We propose a novel KNVQA-Eval, which is devoted to knowledge-based VQA task evaluation to reflect the factuality of multimodal LVLMs.
arXiv Detail & Related papers (2023-11-21T14:39:18Z)
SQUARE: Automatic Question Answering Evaluation using Multiple Positive and Negative References [73.67707138779245]
We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation) We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems.
arXiv Detail & Related papers (2023-09-21T16:51:30Z)
Evaluate What You Can't Evaluate: Unassessable Quality for Generated Response [56.25966921370483]
There are challenges in using reference-free evaluators based on large language models. Reference-free evaluators are more suitable for open-ended examples with different semantics responses. There are risks in using eference-free evaluators based on LLMs to evaluate the quality of dialogue responses.
arXiv Detail & Related papers (2023-05-24T02:52:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.