Prometheus-Vision: Vision-Language Model as a Judge for Fine-Grained
Evaluation
- URL: http://arxiv.org/abs/2401.06591v1
- Date: Fri, 12 Jan 2024 14:19:23 GMT
- Title: Prometheus-Vision: Vision-Language Model as a Judge for Fine-Grained
Evaluation
- Authors: Seongyun Lee and Seungone Kim and Sue Hyun Park and Geewook Kim and
Minjoon Seo
- Abstract summary: We train Prometheus-Vision, the first open-source VLM evaluator model that can understand the user-defined score criteria during evaluation.
Prometheus-Vision shows the highest Pearson correlation with human evaluators and GPT-4V among open-source models.
- Score: 31.062433484245684
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Assessing long-form responses generated by Vision-Language Models (VLMs) is
challenging. It not only requires checking whether the VLM follows the given
instruction but also verifying whether the text output is properly grounded on
the given image. Inspired by the recent approach of evaluating LMs with LMs, in
this work, we propose to evaluate VLMs with VLMs. For this purpose, we present
a new feedback dataset called the Perception Collection, encompassing 15K
customized score rubrics that users might care about during assessment. Using
the Perception Collection, we train Prometheus-Vision, the first open-source
VLM evaluator model that can understand the user-defined score criteria during
evaluation. Prometheus-Vision shows the highest Pearson correlation with human
evaluators and GPT-4V among open-source models, showing its effectiveness for
transparent and accessible evaluation of VLMs. We open-source our code,
dataset, and model at https://github.com/kaistAI/prometheus-vision
Related papers
- Robin: a Suite of Multi-Scale Vision-Language Models and the CHIRP Evaluation Benchmark [22.128954880120222]
The proliferation of Vision-Language Models (VLMs) in the past several years calls for rigorous and comprehensive evaluation methods and benchmarks.
This work analyzes existing VLM evaluation techniques, including automated metrics, AI-based assessments, and human evaluations across diverse tasks.
arXiv Detail & Related papers (2025-01-16T17:08:12Z) - Probing Visual Language Priors in VLMs [51.016683265437536]
We introduce ViLP, a benchmark featuring deliberately out-of-distribution images.
Each question in ViLP is coupled with three potential answers and three corresponding images.
We propose a self-improving framework in which models generate new VQA data, then apply pixel-level and semantic corruptions to form "good-bad" image pairs for self-training.
arXiv Detail & Related papers (2024-12-31T17:54:29Z) - OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation [95.78870389271832]
The standard practice for developing contemporary MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision.
We propose OLA-VLM, the first approach distilling knowledge into the LLM's hidden representations from a set of target visual representations.
We show that OLA-VLM boosts performance by an average margin of up to 2.5% on various benchmarks, with a notable improvement of 8.7% on the Depth task in CV-Bench.
arXiv Detail & Related papers (2024-12-12T18:55:18Z) - Value-Spectrum: Quantifying Preferences of Vision-Language Models via Value Decomposition in Social Media Contexts [33.12056808870413]
We introduce Value-Spectrum, a novel Visual Question Answering (VQA) benchmark aimed at assessing Vision-Language Models (VLMs)
We designed a VLM agent pipeline to simulate video browsing and constructed a vector database comprising over 50,000 short videos from TikTok, YouTube Shorts, and Instagram Reels.
Benchmarking on Value-Spectrum highlights notable variations in how VLMs handle value-oriented content.
arXiv Detail & Related papers (2024-11-18T11:31:10Z) - AutoBench-V: Can Large Vision-Language Models Benchmark Themselves? [65.92331309449015]
We introduce AutoBench-V, an automated framework for serving evaluation on demand, i.e., benchmarking LVLMs based on specific aspects of model capability.
Through an extensive evaluation of nine popular LVLMs across five demanded user inputs, the framework shows effectiveness and reliability.
arXiv Detail & Related papers (2024-10-28T17:55:08Z) - VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM)
VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety.
Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z) - BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-language Models [20.697019266074747]
Vision language models (VLMs) perceive the world through a combination of a visual encoder and a large language model (LLM)
Recent studies show that VLMs are vulnerable to hallucination.
We introduce new metrics: True Understanding (TU), IGnorance (IG), StuBbornness (SB), and InDecision (ID)
arXiv Detail & Related papers (2024-07-18T12:11:12Z) - WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences [122.87483437694706]
We launch WildVision-Arena (WV-Arena), an online platform that collects human preferences to evaluate vision-language models (VLMs)
WV-Bench uses GPT-4 as the judge to compare each VLM with Claude-3-Sonnet, achieving a Spearman correlation of 0.94 with the WV-Arena Elo.
Our comprehensive analysis of 20K real-world interactions reveals important insights into the failure cases of top-performing VLMs.
arXiv Detail & Related papers (2024-06-16T20:53:25Z) - How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for
Vision LLMs [55.91371032213854]
This work focuses on the potential of Vision LLMs (VLLMs) in visual reasoning.
We introduce a comprehensive safety evaluation suite, covering both out-of-distribution (OOD) generalization and adversarial robustness.
arXiv Detail & Related papers (2023-11-27T18:59:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.