TouchStone: Evaluating Vision-Language Models by Language Models
- URL: http://arxiv.org/abs/2308.16890v2
- Date: Mon, 4 Sep 2023 15:06:15 GMT
- Title: TouchStone: Evaluating Vision-Language Models by Language Models
- Authors: Shuai Bai, Shusheng Yang, Jinze Bai, Peng Wang, Xingxuan Zhang,
Junyang Lin, Xinggang Wang, Chang Zhou, Jingren Zhou
- Abstract summary: We propose an evaluation method that uses strong large language models as judges to comprehensively evaluate the various abilities of LVLMs.
We construct a comprehensive visual dialogue dataset TouchStone, consisting of open-world images and questions, covering five major categories of abilities and 27 subtasks.
We demonstrate that powerful LVLMs, such as GPT-4, can effectively score dialogue quality by leveraging their textual capabilities alone.
- Score: 91.69776377214814
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large vision-language models (LVLMs) have recently witnessed rapid
advancements, exhibiting a remarkable capacity for perceiving, understanding,
and processing visual information by connecting visual receptor with large
language models (LLMs). However, current assessments mainly focus on
recognizing and reasoning abilities, lacking direct evaluation of
conversational skills and neglecting visual storytelling abilities. In this
paper, we propose an evaluation method that uses strong LLMs as judges to
comprehensively evaluate the various abilities of LVLMs. Firstly, we construct
a comprehensive visual dialogue dataset TouchStone, consisting of open-world
images and questions, covering five major categories of abilities and 27
subtasks. This dataset not only covers fundamental recognition and
comprehension but also extends to literary creation. Secondly, by integrating
detailed image annotations we effectively transform the multimodal input
content into a form understandable by LLMs. This enables us to employ advanced
LLMs for directly evaluating the quality of the multimodal dialogue without
requiring human intervention. Through validation, we demonstrate that powerful
LVLMs, such as GPT-4, can effectively score dialogue quality by leveraging
their textual capabilities alone, aligning with human preferences. We hope our
work can serve as a touchstone for LVLMs' evaluation and pave the way for
building stronger LVLMs. The evaluation code is available at
https://github.com/OFA-Sys/TouchStone.
Related papers
- OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding [112.87441334765693]
OMG-LLaVA is a new framework combining powerful pixel-level vision understanding with reasoning abilities.
It can accept various visual and text prompts for flexible user interaction.
OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model.
arXiv Detail & Related papers (2024-06-27T17:59:01Z) - Finer: Investigating and Enhancing Fine-Grained Visual Concept Recognition in Large Vision Language Models [57.95366341738857]
In-depth analyses show that instruction-tuned LVLMs exhibit modality gap, showing discrepancy when given textual and visual inputs that correspond to the same concept.
We propose a multiple attribute-centric evaluation benchmark, Finer, to evaluate LVLMs' fine-grained visual comprehension ability and provide significantly improved explainability.
arXiv Detail & Related papers (2024-02-26T05:43:51Z) - Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing [56.71450690166821]
We propose a novel framework, namely Visual Speech Processing incorporated with LLMs (VSP-LLM)
VSP-LLM is designed to perform multi-tasks of visual speech recognition and translation.
We show that VSP-LLM trained on just 30 hours of labeled data can more effectively translate lip movements.
arXiv Detail & Related papers (2024-02-23T07:21:32Z) - VCoder: Versatile Vision Encoders for Multimodal Large Language Models [46.95488342139727]
Multimodal Large Language Models (MLLM) have recently achieved impressive performance on vision-language tasks.
However, when prompted to identify or count (perceive) the entities in a given image, existing MLLM systems fail.
We propose using Versatile vision enCoders (VCoder) as perception eyes for Multimodal LLMs.
arXiv Detail & Related papers (2023-12-21T18:49:47Z) - DialogueLLM: Context and Emotion Knowledge-Tuned Large Language Models
for Emotion Recognition in Conversations [28.15933355881604]
Large language models (LLMs) have shown extraordinary efficacy across numerous downstream natural language processing (NLP) tasks.
We propose DialogueLLM, a context and emotion knowledge tuned LLM that is obtained by fine-tuning LLaMA models.
We offer a comprehensive evaluation of our proposed model on three benchmarking emotion recognition in conversations datasets.
arXiv Detail & Related papers (2023-10-17T16:15:34Z) - Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness
and Ethics [32.123919380959485]
Multi-modal large language models (MLLMs) are trained based on large language models (LLM)
While they excel in multi-modal tasks, the pure NLP abilities of MLLMs are often underestimated and left untested.
We show that visual instruction tuning, a prevailing strategy for transitioning LLMs into MLLMs, unexpectedly and interestingly helps models attain both improved truthfulness and ethical alignment.
arXiv Detail & Related papers (2023-09-13T17:57:21Z) - TinyLVLM-eHub: Towards Comprehensive and Efficient Evaluation for Large Vision-Language Models [86.85389322710674]
This work presents an early and holistic evaluation of Large Vision-Language Models (LVLMs)
It proposes a lightweight variant of LVLM-eHub, named Tiny LVLM-eHub.
It provides a systematic assessment of six categories of multimodal capabilities, including visual perception, visual knowledge acquisition, visual reasoning, visual commonsense, object hallucination, and embodied intelligence.
arXiv Detail & Related papers (2023-08-07T17:17:05Z) - LVLM-eHub: A Comprehensive Evaluation Benchmark for Large
Vision-Language Models [55.304181390027274]
This paper presents a comprehensive evaluation of publicly available large multimodal models by building a LVLM evaluation Hub (LVLM-eHub)
Our LVLM-eHub consists of $8$ representative LVLMs such as InstructBLIP and MiniGPT-4, which are thoroughly evaluated by a quantitative capability evaluation and an online arena platform.
The study reveals several innovative findings. First, instruction-tuned LVLM with massive in-domain data such as InstructBLIP heavily overfits many existing tasks, generalizing poorly in the open-world scenario.
arXiv Detail & Related papers (2023-06-15T16:39:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.