WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences
- URL: http://arxiv.org/abs/2406.11069v1
- Date: Sun, 16 Jun 2024 20:53:25 GMT
- Title: WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences
- Authors: Yujie Lu, Dongfu Jiang, Wenhu Chen, William Yang Wang, Yejin Choi, Bill Yuchen Lin,
- Abstract summary: We launch WildVision-Arena (WV-Arena), an online platform that collects human preferences to evaluate vision-language models (VLMs)
WV-Bench uses GPT-4 as the judge to compare each VLM with Claude-3-Sonnet, achieving a Spearman correlation of 0.94 with the WV-Arena Elo.
Our comprehensive analysis of 20K real-world interactions reveals important insights into the failure cases of top-performing VLMs.
- Score: 122.87483437694706
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent breakthroughs in vision-language models (VLMs) emphasize the necessity of benchmarking human preferences in real-world multimodal interactions. To address this gap, we launched WildVision-Arena (WV-Arena), an online platform that collects human preferences to evaluate VLMs. We curated WV-Bench by selecting 500 high-quality samples from 8,000 user submissions in WV-Arena. WV-Bench uses GPT-4 as the judge to compare each VLM with Claude-3-Sonnet, achieving a Spearman correlation of 0.94 with the WV-Arena Elo. This significantly outperforms other benchmarks like MMVet, MMMU, and MMStar. Our comprehensive analysis of 20K real-world interactions reveals important insights into the failure cases of top-performing VLMs. For example, we find that although GPT-4V surpasses many other models like Reka-Flash, Opus, and Yi-VL-Plus in simple visual recognition and reasoning tasks, it still faces challenges with subtle contextual cues, spatial reasoning, visual imagination, and expert domain knowledge. Additionally, current VLMs exhibit issues with hallucinations and safety when intentionally provoked. We are releasing our chat and feedback data to further advance research in the field of VLMs.
Related papers
- Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models [82.92771279118888]
We introduce Multimodal RewardBench, an expert-annotated benchmark for evaluating multimodal reward models.
Our dataset comprises 5,211 annotated (prompt, chosen response, rejected response) triplets collected from various vision-language models.
We find that even the top-performing models, Gemini 1.5 Pro and Claude 3.5 Sonnet, achieve only 72% overall accuracy.
arXiv Detail & Related papers (2025-02-20T01:48:13Z) - Benchmark Evaluations, Applications, and Challenges of Large Vision Language Models: A Survey [6.73328736679641]
Multimodal Vision Language Models (VLMs) have emerged as a transformative technology at the intersection of computer vision and natural language processing.
VLMs demonstrate strong reasoning and understanding abilities on visual and textual data and beat classical single modality vision models on zero-shot classification.
arXiv Detail & Related papers (2025-01-04T04:59:33Z) - Do Multimodal Large Language Models See Like Humans? [50.938168841711445]
Multimodal Large Language Models (MLLMs) have achieved impressive results on various vision tasks, leveraging recent advancements in large language models.
Do MLLMs perceive visual information similarly to humans?
We introduce HVSBench, a large-scale benchmark designed to assess the alignment between MLLMs and the human visual system (HVS) on fundamental vision tasks that mirror human vision.
arXiv Detail & Related papers (2024-12-12T18:59:25Z) - VisionArena: 230K Real World User-VLM Conversations with Preference Labels [68.11192349083832]
VisionArena is a dataset of 230K real-world conversations between users and vision-language models (VLMs)
Our dataset spans 73K unique users, 45 VLMs, and 138 languages.
We find open-ended tasks like captioning and humor are highly style-dependent, and current VLMs struggle with spatial reasoning and planning tasks.
arXiv Detail & Related papers (2024-12-11T18:59:46Z) - Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension [95.63899307791665]
Vision Value Model (VisVM) can guide VLM inference-time search to generate responses with better visual comprehension.
In this paper, we present VisVM that can guide VLM inference-time search to generate responses with better visual comprehension.
arXiv Detail & Related papers (2024-12-04T20:35:07Z) - VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM)
VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety.
Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z) - BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-language Models [20.697019266074747]
Vision language models (VLMs) perceive the world through a combination of a visual encoder and a large language model (LLM)
Recent studies show that VLMs are vulnerable to hallucination.
We introduce new metrics: True Understanding (TU), IGnorance (IG), StuBbornness (SB), and InDecision (ID)
arXiv Detail & Related papers (2024-07-18T12:11:12Z) - TopViewRS: Vision-Language Models as Top-View Spatial Reasoners [38.406430696146714]
Top-view perspective denotes a typical way in which humans read and reason over different types of maps.
We introduce the TopViewRS dataset, consisting of 11,384 multiple-choice questions with either realistic or semantic top-view map as visual input.
We then use it to study and evaluate VLMs across 4 perception and reasoning tasks with different levels of complexity.
arXiv Detail & Related papers (2024-06-04T17:55:43Z) - Prometheus-Vision: Vision-Language Model as a Judge for Fine-Grained
Evaluation [31.062433484245684]
We train Prometheus-Vision, the first open-source VLM evaluator model that can understand the user-defined score criteria during evaluation.
Prometheus-Vision shows the highest Pearson correlation with human evaluators and GPT-4V among open-source models.
arXiv Detail & Related papers (2024-01-12T14:19:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.