VisionArena: 230K Real World User-VLM Conversations with Preference Labels
- URL: http://arxiv.org/abs/2412.08687v2
- Date: Fri, 13 Dec 2024 23:12:23 GMT
- Title: VisionArena: 230K Real World User-VLM Conversations with Preference Labels
- Authors: Christopher Chou, Lisa Dunlap, Koki Mashita, Krishna Mandal, Trevor Darrell, Ion Stoica, Joseph E. Gonzalez, Wei-Lin Chiang,
- Abstract summary: VisionArena is a dataset of 230K real-world conversations between users and vision-language models (VLMs)
Our dataset spans 73K unique users, 45 VLMs, and 138 languages.
We find open-ended tasks like captioning and humor are highly style-dependent, and current VLMs struggle with spatial reasoning and planning tasks.
- Score: 68.11192349083832
- License:
- Abstract: With the growing adoption and capabilities of vision-language models (VLMs) comes the need for benchmarks that capture authentic user-VLM interactions. In response, we create VisionArena, a dataset of 230K real-world conversations between users and VLMs. Collected from Chatbot Arena - an open-source platform where users interact with VLMs and submit preference votes - VisionArena spans 73K unique users, 45 VLMs, and 138 languages. Our dataset contains three subsets: VisionArena-Chat, 200k single and multi-turn conversations between a user and a VLM; VisionArena-Battle, 30K conversations comparing two anonymous VLMs with user preference votes; and VisionArena-Bench, an automatic benchmark of 500 diverse user prompts that efficiently approximate the live Chatbot Arena model rankings. Additionally, we highlight the types of question asked by users, the influence of response style on preference, and areas where models often fail. We find open-ended tasks like captioning and humor are highly style-dependent, and current VLMs struggle with spatial reasoning and planning tasks. Lastly, we show finetuning the same base model on VisionArena-Chat outperforms Llava-Instruct-158K, with a 17-point gain on MMMU and a 46-point gain on the WildVision benchmark. Dataset at https://huggingface.co/lmarena-ai
Related papers
- MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities [146.4724093405187]
We introduce MM-Vet v2, which includes a new "image-text sequence understanding" capability called "image-text sequence understanding"
Using MM-Vet v2 to benchmark large multimodal models, we found that Claude 3.5 Sonnet is the best model with a score of 71.8, slightly outperforming GPT-4o which scored 71.0.
arXiv Detail & Related papers (2024-08-01T17:59:54Z) - BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-language Models [20.697019266074747]
Vision language models (VLMs) perceive the world through a combination of a visual encoder and a large language model (LLM)
Recent studies show that VLMs are vulnerable to hallucination.
We introduce new metrics: True Understanding (TU), IGnorance (IG), StuBbornness (SB), and InDecision (ID)
arXiv Detail & Related papers (2024-07-18T12:11:12Z) - WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences [122.87483437694706]
We launch WildVision-Arena (WV-Arena), an online platform that collects human preferences to evaluate vision-language models (VLMs)
WV-Bench uses GPT-4 as the judge to compare each VLM with Claude-3-Sonnet, achieving a Spearman correlation of 0.94 with the WV-Arena Elo.
Our comprehensive analysis of 20K real-world interactions reveals important insights into the failure cases of top-performing VLMs.
arXiv Detail & Related papers (2024-06-16T20:53:25Z) - An Introduction to Vision-Language Modeling [128.6223984157515]
The vision-language model (VLM) applications will significantly impact our relationship with technology.
We introduce what VLMs are, how they work, and how to train them.
Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos.
arXiv Detail & Related papers (2024-05-27T15:01:23Z) - Prometheus-Vision: Vision-Language Model as a Judge for Fine-Grained
Evaluation [31.062433484245684]
We train Prometheus-Vision, the first open-source VLM evaluator model that can understand the user-defined score criteria during evaluation.
Prometheus-Vision shows the highest Pearson correlation with human evaluators and GPT-4V among open-source models.
arXiv Detail & Related papers (2024-01-12T14:19:23Z) - LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset [75.9621305227523]
We introduce LMSYS-Chat-1M, a large-scale dataset containing one million real-world conversations with 25 state-of-the-art large language models (LLMs)
This dataset is collected from 210K IP addresses in the wild on our Vicuna demo and Arena website.
We demonstrate its versatility through four use cases: developing content moderation models that perform similarly to GPT-4, building a safety benchmark, training instruction-following models that perform similarly to Vicuna, and creating challenging benchmark questions.
arXiv Detail & Related papers (2023-09-21T12:13:55Z) - TikTalk: A Video-Based Dialogue Dataset for Multi-Modal Chitchat in Real
World [97.58623810402563]
We introduce a new video-based multi-modal dialogue dataset, called TikTalk.
We collect 38K videos from a popular video-sharing platform, along with 367K conversations posted by users beneath them.
Users engage in spontaneous conversations based on their multi-modal experiences from watching videos, which helps recreate real-world chitchat context.
arXiv Detail & Related papers (2023-01-14T10:18:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.