KNVQA: A Benchmark for evaluation knowledge-based VQA
- URL: http://arxiv.org/abs/2311.12639v2
- Date: Thu, 13 Jun 2024 09:12:48 GMT
- Title: KNVQA: A Benchmark for evaluation knowledge-based VQA
- Authors: Sirui Cheng, Siyu Zhang, Jiayi Wu, Muchen Lan,
- Abstract summary: Large vision-language models (LVLMs) have made significant progress due to their strong perception and reasoning capabilities in the visual and language systems.
LVLMs are still plagued by the two critical issues of object hallucination and factual accuracy, which limit the practicality of LVLMs in different scenarios.
We propose a novel KNVQA-Eval, which is devoted to knowledge-based VQA task evaluation to reflect the factuality of multimodal LVLMs.
- Score: 8.602776661652083
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Within the multimodal field, large vision-language models (LVLMs) have made significant progress due to their strong perception and reasoning capabilities in the visual and language systems. However, LVLMs are still plagued by the two critical issues of object hallucination and factual accuracy, which limit the practicality of LVLMs in different scenarios. Furthermore, previous evaluation methods focus more on the comprehension and reasoning of language content but lack a comprehensive evaluation of multimodal interactions, thereby resulting in potential limitations. To this end, we propose a novel KNVQA-Eval, which is devoted to knowledge-based VQA task evaluation to reflect the factuality of multimodal LVLMs. To ensure the robustness and scalability of the evaluation, we develop a new KNVQA dataset by incorporating human judgment and perception, aiming to evaluate the accuracy of standard answers relative to AI-generated answers in knowledge-based VQA. This work not only comprehensively evaluates the contextual information of LVLMs using reliable human annotations, but also further analyzes the fine-grained capabilities of current methods to reveal potential avenues for subsequent optimization of LVLMs-based estimators. Our proposed VQA-Eval and corresponding dataset KNVQA will facilitate the development of automatic evaluation tools with the advantages of low cost, privacy protection, and reproducibility. Our code will be released upon publication.
Related papers
- Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset [94.13848736705575]
We introduce Facial Identity Unlearning Benchmark (FIUBench), a novel VLM unlearning benchmark designed to robustly evaluate the effectiveness of unlearning algorithms.
We apply a two-stage evaluation pipeline that is designed to precisely control the sources of information and their exposure levels.
Through the evaluation of four baseline VLM unlearning algorithms within FIUBench, we find that all methods remain limited in their unlearning performance.
arXiv Detail & Related papers (2024-11-05T23:26:10Z) - MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)
MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.
It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z) - Beyond the Hype: A dispassionate look at vision-language models in medical scenario [3.4299097748670255]
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across diverse tasks.
Their performance and reliability in specialized domains such as medicine remain insufficiently assessed.
We introduce RadVUQA, a novel benchmark to comprehensively evaluate existing LVLMs.
arXiv Detail & Related papers (2024-08-16T12:32:44Z) - Towards Flexible Evaluation for Generative Visual Question Answering [17.271448204525612]
This paper proposes the use of semantics-based evaluators for assessing unconstrained open-ended responses on Visual Question Answering (VQA) datasets.
In addition, this paper proposes a Semantically Flexible VQA Evaluator (SFVE) with meticulous design based on the unique features of VQA evaluation.
arXiv Detail & Related papers (2024-08-01T05:56:34Z) - LOVA3: Learning to Visual Question Answering, Asking and Assessment [61.51687164769517]
Question answering, asking, and assessment are three innate human traits crucial for understanding the world and acquiring knowledge.
Current Multimodal Large Language Models (MLLMs) primarily focus on question answering, often neglecting the full potential of questioning and assessment skills.
We introduce LOVA3, an innovative framework named "Learning tO Visual question Answering, Asking and Assessment"
arXiv Detail & Related papers (2024-05-23T18:21:59Z) - Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level
Vision [85.6008224440157]
Multi-modality Large Language Models (MLLMs) have catalyzed a shift in computer vision from specialized models to general-purpose foundation models.
We present Q-Bench, a holistic benchmark crafted to evaluate potential abilities of MLLMs on three realms: low-level visual perception, low-level visual description, and overall visual quality assessment.
arXiv Detail & Related papers (2023-09-25T14:43:43Z) - LVLM-eHub: A Comprehensive Evaluation Benchmark for Large
Vision-Language Models [55.304181390027274]
This paper presents a comprehensive evaluation of publicly available large multimodal models by building a LVLM evaluation Hub (LVLM-eHub)
Our LVLM-eHub consists of $8$ representative LVLMs such as InstructBLIP and MiniGPT-4, which are thoroughly evaluated by a quantitative capability evaluation and an online arena platform.
The study reveals several innovative findings. First, instruction-tuned LVLM with massive in-domain data such as InstructBLIP heavily overfits many existing tasks, generalizing poorly in the open-world scenario.
arXiv Detail & Related papers (2023-06-15T16:39:24Z) - Towards Robust Text-Prompted Semantic Criterion for In-the-Wild Video
Quality Assessment [54.31355080688127]
We introduce a text-prompted Semantic Affinity Quality Index (SAQI) and its localized version (SAQI-Local) using Contrastive Language-Image Pre-training (CLIP)
BVQI-Local demonstrates unprecedented performance, surpassing existing zero-shot indices by at least 24% on all datasets.
We conduct comprehensive analyses to investigate different quality concerns of distinct indices, demonstrating the effectiveness and rationality of our design.
arXiv Detail & Related papers (2023-04-28T08:06:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.