Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types
- URL: http://arxiv.org/abs/2409.09269v1
- Date: Sat, 14 Sep 2024 02:29:36 GMT
- Title: Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types
- Authors: Neelabh Sinha, Vinija Jain, Aman Chadha,
- Abstract summary: Visual Question-Answering (VQA) has become a key use-case in several applications to aid user experience.
We present a novel dataset derived from established VQA benchmarks, annotated with task types, application domains, and knowledge types.
We also introduce GoEval, a multimodal evaluation metric developed using GPT-4o, achieving a correlation factor of 56.71% with human judgments.
- Score: 0.9217021281095907
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual Question-Answering (VQA) has become a key use-case in several applications to aid user experience, particularly after Vision-Language Models (VLMs) achieving good results in zero-shot inference. But evaluating different VLMs for an application requirement using a standardized framework in practical settings is still challenging. This paper introduces a comprehensive framework for evaluating VLMs tailored to VQA tasks in practical settings. We present a novel dataset derived from established VQA benchmarks, annotated with task types, application domains, and knowledge types, three key practical aspects on which tasks can vary. We also introduce GoEval, a multimodal evaluation metric developed using GPT-4o, achieving a correlation factor of 56.71% with human judgments. Our experiments with ten state-of-the-art VLMs reveals that no single model excelling universally, making appropriate selection a key design decision. Proprietary models such as Gemini-1.5-Pro and GPT-4o-mini generally outperform others, though open-source models like InternVL-2-8B and CogVLM-2-Llama-3-19B demonstrate competitive strengths in specific contexts, while providing additional advantages. This study guides the selection of VLMs based on specific task requirements and resource constraints, and can also be extended to other vision-language tasks.
Related papers
- Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models [82.92771279118888]
We introduce Multimodal RewardBench, an expert-annotated benchmark for evaluating multimodal reward models.
Our dataset comprises 5,211 annotated (prompt, chosen response, rejected response) triplets collected from various vision-language models.
We find that even the top-performing models, Gemini 1.5 Pro and Claude 3.5 Sonnet, achieve only 72% overall accuracy.
arXiv Detail & Related papers (2025-02-20T01:48:13Z) - VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models [66.56298924208319]
Vision-language generative reward models (VL-GenRMs) play a crucial role in aligning and evaluating multimodal AI systems.
Current assessment methods rely on AI-annotated preference labels from traditional tasks.
We introduce VL-RewardBench, a benchmark spanning general multimodal queries, visual hallucination detection, and complex reasoning tasks.
arXiv Detail & Related papers (2024-11-26T14:08:34Z) - AutoBench-V: Can Large Vision-Language Models Benchmark Themselves? [65.92331309449015]
We introduce AutoBench-V, an automated framework for serving evaluation on demand, i.e., benchmarking LVLMs based on specific aspects of model capability.
Through an extensive evaluation of nine popular LVLMs across five demanded user inputs, the framework shows effectiveness and reliability.
arXiv Detail & Related papers (2024-10-28T17:55:08Z) - VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM)
VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety.
Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z) - DARE: Diverse Visual Question Answering with Robustness Evaluation [16.87867803628065]
Vision Language Models (VLMs) extend remarkable capabilities of text-only large language models and vision-only models.
They struggle with a number of crucial vision-language (VL) reasoning abilities such as counting and spatial reasoning.
We introduce DARE, Diverse Visual Question Answering with Robustness Evaluation.
arXiv Detail & Related papers (2024-09-26T16:31:50Z) - @Bench: Benchmarking Vision-Language Models for Human-centered Assistive Technology [31.779074930032184]
Human-centered Assistive Technologies (ATs) for helping People with Visual Impairments (PVIs) are evolving into generalists, capable of performing multiple tasks simultaneously.
We first create a novel AT benchmark (@Bench) guided by a pre-design user study with PVIs.
Besides, we propose a novel AT model (@Model) that addresses all tasks simultaneously and can be expanded to more assistive functions for helping PVIs.
arXiv Detail & Related papers (2024-09-21T18:30:17Z) - What is the best model? Application-driven Evaluation for Large Language Models [7.054112690519648]
A-Eval is an application-driven evaluation benchmark for general large language models.
We construct a dataset comprising 678 question-and-answer pairs through a process of collecting, annotating, and reviewing.
We reveal interesting laws regarding model scale and task difficulty level and propose a feasible method for selecting the best model.
arXiv Detail & Related papers (2024-06-14T04:52:15Z) - Beyond Sole Strength: Customized Ensembles for Generalized Vision-Language Models [55.5610165938949]
Fine-tuning vision-language models (VLMs) has gained increasing popularity due to its practical value.
This paper explores the collaborative potential of leveraging much weaker VLMs to enhance the generalization of a robust single model.
We introduce three customized ensemble strategies, each tailored to one specific scenario.
The proposed ensemble strategies are evaluated on zero-shot, base-to-new, and cross-dataset generalization, achieving new state-of-the-art performance.
arXiv Detail & Related papers (2023-11-28T05:17:25Z) - MMBench: Is Your Multi-modal Model an All-around Player? [114.45702807380415]
We propose MMBench, a benchmark for assessing the multi-modal capabilities of vision-language models.
MMBench is meticulously curated with well-designed quality control schemes.
MMBench incorporates multiple-choice questions in both English and Chinese versions.
arXiv Detail & Related papers (2023-07-12T16:23:09Z) - Reassessing Evaluation Practices in Visual Question Answering: A Case
Study on Out-of-Distribution Generalization [27.437077941786768]
Vision-and-language (V&L) models pretrained on large-scale multimodal data have demonstrated strong performance on various tasks.
We evaluate two pretrained V&L models under different settings by conducting cross-dataset evaluations.
We find that these models tend to learn to solve the benchmark, rather than learning the high-level skills required by the VQA task.
arXiv Detail & Related papers (2022-05-24T16:44:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.