Related papers: MultiVerse: A Multi-Turn Conversation Benchmark for Evaluating Large Vision and Language Models

MultiVerse: A Multi-Turn Conversation Benchmark for Evaluating Large Vision and Language Models

URL: http://arxiv.org/abs/2510.16641v1
Date: Sat, 18 Oct 2025 21:00:12 GMT
Title: MultiVerse: A Multi-Turn Conversation Benchmark for Evaluating Large Vision and Language Models
Authors: Young-Jun Lee, Byung-Kwan Lee, Jianshu Zhang, Yechan Hwang, Byungsoo Ko, Han-Gyu Kim, Dongyu Yao, Xuankun Rong, Eojin Joo, Seung-Ho Han, Bowon Ko, Ho-Jin Choi,
Abstract summary: MultiVerse is a novel multi-turn conversation benchmark featuring 647 dialogues - each averaging four turns.<n>With 484 tasks and 484 interaction goals, MultiVerse covers a wide range of topics, from factual knowledge and perception to advanced reasoning tasks such as mathematics and coding.<n>We evaluate 18 Vision-and-Language Models (VLMs) on MultiVerse, revealing that even the strongest models achieve only a 50% success rate in complex multi-turn conversations.
Score: 25.072791108956682
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-and-Language Models (VLMs) have shown impressive capabilities on single-turn benchmarks, yet real-world applications often demand more intricate multi-turn dialogues. Existing multi-turn datasets (e.g, MMDU, ConvBench) only partially capture the breadth and depth of conversational scenarios encountered by users. In this work, we introduce MultiVerse, a novel multi-turn conversation benchmark featuring 647 dialogues - each averaging four turns - derived from a diverse set of 12 popular VLM evaluation benchmarks. With 484 tasks and 484 interaction goals, MultiVerse covers a wide range of topics, from factual knowledge and perception to advanced reasoning tasks such as mathematics and coding. To facilitate robust assessment, we propose a checklist-based evaluation method that leverages GPT-4o as the automated evaluator, measuring performance across 37 key aspects, including perceptual accuracy, linguistic clarity, and factual correctness. We evaluate 18 VLMs on MultiVerse, revealing that even the strongest models (e.g., GPT-4o) achieve only a 50% success rate in complex multi-turn conversations, highlighting the dataset's challenging nature. Notably, we find that providing full dialogue context significantly enhances performance for smaller or weaker models, emphasizing the importance of in-context learning. We believe MultiVerse is a landscape of evaluating multi-turn interaction abilities for VLMs.

Related papers

Seeing is Not Understanding: A Benchmark on Perception-Cognition Disparities in Large Language Models [9.870930749379932]
EmoBench-Reddit is a novel, hierarchical benchmark for multimodal emotion understanding.<n>The dataset comprises 350 meticulously curated samples from the social media platform Reddit.<n>Each data point features six multiple-choice questions and one open-ended question of increasing difficulty.
arXiv Detail & Related papers (2025-09-14T05:40:24Z)
ContextualLVLM-Agent: A Holistic Framework for Multi-Turn Visually-Grounded Dialogue and Complex Instruction Following [0.2999888908665658]
We introduce MMDR-Bench (Multi-Modal Dialogue Reasoning Benchmark), a novel dataset comprising 300 complex multi-turn dialogue scenarios.<n>We also propose CoLVLM Agent (Contextual LVLM Agent), a holistic framework that enhances existing LVLMs with advanced reasoning and instruction following capabilities.<n>Our experiments on MMDR-Bench demonstrate that CoLVLM Agent consistently achieves superior performance, attaining an average human evaluation score of 4.03.
arXiv Detail & Related papers (2025-08-21T02:09:02Z)
Contra4: Evaluating Contrastive Cross-Modal Reasoning in Audio, Video, Image, and 3D [97.08549913899247]
We introduce Contra4, a dataset for contrastive cross-modal reasoning across four modalities: image, audio, video, and 3D.<n>Each example presents a natural language question alongside multiple candidate modality instances, and the model must select the one that semantically aligns with the prompt.<n>Contra4 combines human-annotated captions with a mixture-of-models round-trip-consistency filter to ensure high-quality supervision, resulting in 174k training examples and a manually verified test set of 2.3k samples.
arXiv Detail & Related papers (2025-06-02T03:12:13Z)
VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information.<n>We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning.<n>We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z)
Beyond Visual Understanding: Introducing PARROT-360V for Vision Language Model Benchmarking [0.12369742273401668]
We introduce the PARROT-360V Benchmark, a novel and comprehensive benchmark featuring 2487 challenging visual puzzles. We evaluate leading models: GPT-4o, Claude-3.5-Sonnet, and Gemini-1.5-Pro. State-of-the-art models scored between 28 to 56 percentage on our benchmark, significantly lower than their performance on popular benchmarks.
arXiv Detail & Related papers (2024-11-20T01:09:21Z)
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)<n>MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.<n>It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z)
DARE: Diverse Visual Question Answering with Robustness Evaluation [16.87867803628065]
Vision Language Models (VLMs) extend remarkable capabilities of text-only large language models and vision-only models.<n>They struggle with a number of crucial vision-language (VL) reasoning abilities such as counting and spatial reasoning.<n>We introduce DARE, Diverse Visual Question Answering with Robustness Evaluation.
arXiv Detail & Related papers (2024-09-26T16:31:50Z)
MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models [76.1999277491816]
Multimodal Multi-image Understanding (MMIU) is a comprehensive evaluation suite designed to assess Large Vision-Language Models (LVLMs) MMIU encompasses 7 types of multi-image relationships, 52 tasks, 77K images, and 11K meticulously curated multiple-choice questions. Our evaluation of 24 popular LVLMs, including both open-source and proprietary models, reveals significant challenges in multi-image comprehension.
arXiv Detail & Related papers (2024-08-05T17:56:41Z)
M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models [27.18427414844769]
We introduce M4U, a novel benchmark for assessing the capability of multi-discipline multilingual multimodal understanding and reasoning.<n>M4U contains 10k samples covering 64 disciplines across 16 subfields in Science, Engineering, and Healthcare in six languages.<n>Using M4U, we conduct extensive evaluations of leading Large Multimodal Models (LMMs) and Large Language Models (LLMs) with external tools.
arXiv Detail & Related papers (2024-05-24T15:25:28Z)
MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues [58.33076950775072]
MT-Bench-101 is designed to evaluate the fine-grained abilities of Large Language Models (LLMs) in multi-turn dialogues. We construct a three-tier hierarchical ability taxonomy comprising 4208 turns across 1388 multi-turn dialogues in 13 distinct tasks. We then evaluate 21 popular LLMs based on MT-Bench-101, conducting comprehensive analyses from both ability and task perspectives.
arXiv Detail & Related papers (2024-02-22T18:21:59Z)
MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models [70.92847554971065]
We introduce MT-Eval, a comprehensive benchmark designed to evaluate multi-turn conversational abilities. By analyzing human-LLM conversations, we categorize interaction patterns into four types: recollection, expansion, refinement, and follow-up. Our evaluation of 11 well-known LLMs shows that while closed-source models generally surpass open-source ones, certain open-source models exceed GPT-3.5-Turbo in specific tasks.
arXiv Detail & Related papers (2024-01-30T04:50:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.