LVLM-eHub: A Comprehensive Evaluation Benchmark for Large
Vision-Language Models
- URL: http://arxiv.org/abs/2306.09265v1
- Date: Thu, 15 Jun 2023 16:39:24 GMT
- Title: LVLM-eHub: A Comprehensive Evaluation Benchmark for Large
Vision-Language Models
- Authors: Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei,
Fanqing Meng, Siyuan Huang, Yu Qiao, Ping Luo
- Abstract summary: This paper presents a comprehensive evaluation of publicly available large multimodal models by building a LVLM evaluation Hub (LVLM-eHub)
Our LVLM-eHub consists of $8$ representative LVLMs such as InstructBLIP and MiniGPT-4, which are thoroughly evaluated by a quantitative capability evaluation and an online arena platform.
The study reveals several innovative findings. First, instruction-tuned LVLM with massive in-domain data such as InstructBLIP heavily overfits many existing tasks, generalizing poorly in the open-world scenario.
- Score: 55.304181390027274
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Vision-Language Models (LVLMs) have recently played a dominant role in
multimodal vision-language learning. Despite the great success, it lacks a
holistic evaluation of their efficacy. This paper presents a comprehensive
evaluation of publicly available large multimodal models by building a LVLM
evaluation Hub (LVLM-eHub). Our LVLM-eHub consists of $8$ representative LVLMs
such as InstructBLIP and MiniGPT-4, which are thoroughly evaluated by a
quantitative capability evaluation and an online arena platform. The former
evaluates $6$ categories of multimodal capabilities of LVLMs such as visual
question answering and embodied artificial intelligence on $47$ standard
text-related visual benchmarks, while the latter provides the user-level
evaluation of LVLMs in an open-world question-answering scenario. The study
reveals several innovative findings. First, instruction-tuned LVLM with massive
in-domain data such as InstructBLIP heavily overfits many existing tasks,
generalizing poorly in the open-world scenario. Second, instruction-tuned LVLM
with moderate instruction-following data may result in object hallucination
issues (i.e., generate objects that are inconsistent with target images in the
descriptions). It either makes the current evaluation metric such as CIDEr for
image captioning ineffective or generates wrong answers. Third, employing a
multi-turn reasoning evaluation framework can mitigate the issue of object
hallucination, shedding light on developing an effective pipeline for LVLM
evaluation. The findings provide a foundational framework for the conception
and assessment of innovative strategies aimed at enhancing zero-shot multimodal
techniques. Our LVLM-eHub will be available at
https://github.com/OpenGVLab/Multi-Modality-Arena
Related papers
- AutoBench-V: Can Large Vision-Language Models Benchmark Themselves? [55.14033256706175]
Large Vision-Language Models (LVLMs) have become essential for advancing the integration of visual and linguistic information.
We introduce AutoBench-V, an automated framework for serving evaluation on demand.
Through an extensive evaluation of seven popular LVLMs across five demanded user inputs, the framework shows effectiveness and reliability.
arXiv Detail & Related papers (2024-10-28T17:55:08Z) - Are We on the Right Way for Evaluating Large Vision-Language Models? [92.5761176224556]
Large vision-language models (LVLMs) have recently achieved rapid progress, sparking numerous studies to evaluate their multi-modal capabilities.
We identify two primary issues: Visual content is unnecessary for many samples and intentional data leakage exists.
We present MMStar, an elite vision-indispensable multi-modal benchmark comprising 1,500 samples meticulously selected by humans.
arXiv Detail & Related papers (2024-03-29T17:59:34Z) - Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models [50.653838482083614]
This paper introduces a scalable test-bed to assess the capabilities of IT-LVLMs on fundamental computer vision tasks.
MERLIM contains over 300K image-question pairs and has a strong focus on detecting cross-modal "hallucination" events in IT-LVLMs.
arXiv Detail & Related papers (2023-12-03T16:39:36Z) - VLM-Eval: A General Evaluation on Video Large Language Models [16.92780012093112]
We introduce a unified evaluation that encompasses multiple video tasks, including captioning, question and answering, retrieval, and action recognition.
We propose a simple baseline: Video-LLaVA, which uses a single linear projection and outperforms existing video LLMs.
We evaluate video LLMs beyond academic datasets, which show encouraging recognition and reasoning capabilities in driving scenarios with only hundreds of video-instruction pairs for fine-tuning.
arXiv Detail & Related papers (2023-11-20T16:02:10Z) - TouchStone: Evaluating Vision-Language Models by Language Models [91.69776377214814]
We propose an evaluation method that uses strong large language models as judges to comprehensively evaluate the various abilities of LVLMs.
We construct a comprehensive visual dialogue dataset TouchStone, consisting of open-world images and questions, covering five major categories of abilities and 27 subtasks.
We demonstrate that powerful LVLMs, such as GPT-4, can effectively score dialogue quality by leveraging their textual capabilities alone.
arXiv Detail & Related papers (2023-08-31T17:52:04Z) - TinyLVLM-eHub: Towards Comprehensive and Efficient Evaluation for Large Vision-Language Models [86.85389322710674]
This work presents an early and holistic evaluation of Large Vision-Language Models (LVLMs)
It proposes a lightweight variant of LVLM-eHub, named Tiny LVLM-eHub.
It provides a systematic assessment of six categories of multimodal capabilities, including visual perception, visual knowledge acquisition, visual reasoning, visual commonsense, object hallucination, and embodied intelligence.
arXiv Detail & Related papers (2023-08-07T17:17:05Z) - LLMMaps -- A Visual Metaphor for Stratified Evaluation of Large Language
Models [13.659853119356507]
Large Language Models (LLMs) have revolutionized natural language processing and demonstrated impressive capabilities in various tasks.
They are prone to hallucinations, where the model exposes incorrect or false information in its responses.
We propose LLMMaps as a novel visualization technique that enables users to evaluate LLMs' performance with respect to Q&A datasets.
arXiv Detail & Related papers (2023-04-02T05:47:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.