Prometheus-Vision: Vision-Language Model as a Judge for Fine-Grained
Evaluation
- URL: http://arxiv.org/abs/2401.06591v1
- Date: Fri, 12 Jan 2024 14:19:23 GMT
- Title: Prometheus-Vision: Vision-Language Model as a Judge for Fine-Grained
Evaluation
- Authors: Seongyun Lee and Seungone Kim and Sue Hyun Park and Geewook Kim and
Minjoon Seo
- Abstract summary: We train Prometheus-Vision, the first open-source VLM evaluator model that can understand the user-defined score criteria during evaluation.
Prometheus-Vision shows the highest Pearson correlation with human evaluators and GPT-4V among open-source models.
- Score: 31.062433484245684
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Assessing long-form responses generated by Vision-Language Models (VLMs) is
challenging. It not only requires checking whether the VLM follows the given
instruction but also verifying whether the text output is properly grounded on
the given image. Inspired by the recent approach of evaluating LMs with LMs, in
this work, we propose to evaluate VLMs with VLMs. For this purpose, we present
a new feedback dataset called the Perception Collection, encompassing 15K
customized score rubrics that users might care about during assessment. Using
the Perception Collection, we train Prometheus-Vision, the first open-source
VLM evaluator model that can understand the user-defined score criteria during
evaluation. Prometheus-Vision shows the highest Pearson correlation with human
evaluators and GPT-4V among open-source models, showing its effectiveness for
transparent and accessible evaluation of VLMs. We open-source our code,
dataset, and model at https://github.com/kaistAI/prometheus-vision
Related papers
- BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-language Models [20.697019266074747]
Vision language models (VLMs) perceive the world through a combination of a visual encoder and a large language model (LLM)
Recent studies show that VLMs are vulnerable to hallucination.
We introduce new metrics: True Understanding (TU), IGnorance (IG), StuBbornness (SB), and InDecision (ID)
arXiv Detail & Related papers (2024-07-18T12:11:12Z) - Review-LLM: Harnessing Large Language Models for Personalized Review Generation [8.898103706804616]
Large Language Models (LLMs) have shown superior text modeling and generating ability.
We propose Review-LLM that customizes LLMs for personalized review generation.
arXiv Detail & Related papers (2024-07-10T09:22:19Z) - WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences [122.87483437694706]
We launch WildVision-Arena (WV-Arena), an online platform that collects human preferences to evaluate vision-language models (VLMs)
WV-Bench uses GPT-4 as the judge to compare each VLM with Claude-3-Sonnet, achieving a Spearman correlation of 0.94 with the WV-Arena Elo.
Our comprehensive analysis of 20K real-world interactions reveals important insights into the failure cases of top-performing VLMs.
arXiv Detail & Related papers (2024-06-16T20:53:25Z) - Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning.
Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z) - Vi(E)va LLM! A Conceptual Stack for Evaluating and Interpreting
Generative AI-based Visualizations [1.709620026135923]
Large language models (LLM) have become an interesting option for supporting generative tasks related to visualization.
This paper copes with the problem of modeling the evaluation of a generated visualization through an LLM.
We propose a theoretical evaluation stack, EvaLLM, that decomposes the evaluation effort in its atomic components.
arXiv Detail & Related papers (2024-02-03T14:28:55Z) - Silkie: Preference Distillation for Large Visual Language Models [56.10697821410489]
This paper explores preference distillation for large vision language models (LVLMs)
We first build a vision-language feedback dataset utilizing AI annotation.
We adopt GPT-4V to assess the generated outputs regarding helpfulness, visual faithfulness, and ethical considerations.
The resulting model Silkie, achieves 6.9% and 9.5% relative improvement on the MME benchmark regarding the perception and cognition capabilities.
arXiv Detail & Related papers (2023-12-17T09:44:27Z) - How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for
Vision LLMs [55.91371032213854]
This work focuses on the potential of Vision LLMs (VLLMs) in visual reasoning.
We introduce a comprehensive safety evaluation suite, covering both out-of-distribution (OOD) generalization and adversarial robustness.
arXiv Detail & Related papers (2023-11-27T18:59:42Z) - A Closer Look into Automatic Evaluation Using Large Language Models [75.49360351036773]
We discuss how details in the evaluation process change how well the ratings given by LLMs correlate with human ratings.
We find that the auto Chain-of-Thought (CoT) used in G-Eval does not always make G-Eval more aligned with human ratings.
We also show that forcing the LLM to output only a numeric rating, as in G-Eval, is suboptimal.
arXiv Detail & Related papers (2023-10-09T12:12:55Z) - LVLM-eHub: A Comprehensive Evaluation Benchmark for Large
Vision-Language Models [55.304181390027274]
This paper presents a comprehensive evaluation of publicly available large multimodal models by building a LVLM evaluation Hub (LVLM-eHub)
Our LVLM-eHub consists of $8$ representative LVLMs such as InstructBLIP and MiniGPT-4, which are thoroughly evaluated by a quantitative capability evaluation and an online arena platform.
The study reveals several innovative findings. First, instruction-tuned LVLM with massive in-domain data such as InstructBLIP heavily overfits many existing tasks, generalizing poorly in the open-world scenario.
arXiv Detail & Related papers (2023-06-15T16:39:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.