MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs
- URL: http://arxiv.org/abs/2406.11833v1
- Date: Mon, 17 Jun 2024 17:59:47 GMT
- Title: MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs
- Authors: Ziyu Liu, Tao Chu, Yuhang Zang, Xilin Wei, Xiaoyi Dong, Pan Zhang, Zijian Liang, Yuanjun Xiong, Yu Qiao, Dahua Lin, Jiaqi Wang,
- Abstract summary: We introduce MMDU, a comprehensive benchmark, and MMDU-45k, a large-scale instruction tuning dataset.
MMDU has a maximum of 18k image+text tokens, 20 images, and 27 turns, which is at least 5x longer than previous benchmarks.
We demonstrate that ffne-tuning open-source LVLMs on MMDU-45k signiffcantly address this gap, generating longer and more accurate conversations.
- Score: 88.28014831467503
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating natural and meaningful responses to communicate with multi-modal human inputs is a fundamental capability of Large Vision-Language Models(LVLMs). While current open-source LVLMs demonstrate promising performance in simplified scenarios such as single-turn single-image input, they fall short in real-world conversation scenarios such as following instructions in a long context history with multi-turn and multi-images. Existing LVLM benchmarks primarily focus on single-choice questions or short-form responses, which do not adequately assess the capabilities of LVLMs in real-world human-AI interaction applications. Therefore, we introduce MMDU, a comprehensive benchmark, and MMDU-45k, a large-scale instruction tuning dataset, designed to evaluate and improve LVLMs' abilities in multi-turn and multi-image conversations. We employ the clustering algorithm to ffnd the relevant images and textual descriptions from the open-source Wikipedia and construct the question-answer pairs by human annotators with the assistance of the GPT-4o model. MMDU has a maximum of 18k image+text tokens, 20 images, and 27 turns, which is at least 5x longer than previous benchmarks and poses challenges to current LVLMs. Our in-depth analysis of 15 representative LVLMs using MMDU reveals that open-source LVLMs lag behind closed-source counterparts due to limited conversational instruction tuning data. We demonstrate that ffne-tuning open-source LVLMs on MMDU-45k signiffcantly address this gap, generating longer and more accurate conversations, and improving scores on MMDU and existing benchmarks (MMStar: +1.1%, MathVista: +1.5%, ChartQA:+1.2%). Our contributions pave the way for bridging the gap between current LVLM models and real-world application demands. This project is available at https://github.com/Liuziyu77/MMDU.
Related papers
- MIBench: Evaluating Multimodal Large Language Models over Multiple Images [70.44423964171088]
We propose a new benchmark MIBench to comprehensively evaluate fine-grained abilities of MLLMs in multi-image scenarios.
MIBench categorizes the multi-image abilities into three scenarios: multi-image instruction (MII), multimodal knowledge-seeking (MKS) and multimodal in-context learning (MIC), and constructs 13 tasks with a total of 13K annotated samples.
The results reveal that although current models excel in single-image tasks, they exhibit significant shortcomings when faced with multi-image inputs.
arXiv Detail & Related papers (2024-07-21T21:22:58Z) - Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models [10.41857522464292]
We introduce the MultiModal Needle-in-a-haystack (MMNeedle) benchmark to assess the long-context capabilities of MLLMs.
We employ image stitching to further increase the input context length, and develop a protocol to automatically generate labels for sub-image level retrieval.
We evaluate state-of-the-art MLLMs, encompassing both API-based and open-source models.
arXiv Detail & Related papers (2024-06-17T05:54:06Z) - Are We on the Right Way for Evaluating Large Vision-Language Models? [92.5761176224556]
Large vision-language models (LVLMs) have recently achieved rapid progress, sparking numerous studies to evaluate their multi-modal capabilities.
We identify two primary issues: Visual content is unnecessary for many samples and intentional data leakage exists.
We present MMStar, an elite vision-indispensable multi-modal benchmark comprising 1,500 samples meticulously selected by humans.
arXiv Detail & Related papers (2024-03-29T17:59:34Z) - Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion [70.9767518332692]
Multimodal Large Language Models (MLLMs) that incorporate LLMs with pre-trained vision models have recently demonstrated impressive performance across diverse vision-language tasks.
However, they fall short to comprehend context involving multiple images.
We propose a two phase paradigm, browse-and-concentrate, to enable in-depth multimodal context fusion.
arXiv Detail & Related papers (2024-02-19T14:59:07Z) - ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models [45.040292339670096]
Large vision-language models (LVLMs) have shown premise in a broad range of vision-language tasks with their strong reasoning and generalization capabilities.
This study aims to bridge the performance gap between traditional-scale LVLMs and resource-friendly lite versions by adopting high-quality training data.
arXiv Detail & Related papers (2024-02-18T19:26:49Z) - Rec-GPT4V: Multimodal Recommendation with Large Vision-Language Models [48.129934341928355]
We propose a novel reasoning scheme named Rec-GPT4V: Visual-Summary Thought (VST)
We utilize user history as in-context user preferences to address the first challenge.
Next, we prompt LVLMs to generate item image summaries and utilize image comprehension in natural language space combined with item titles to query the user preferences over candidate items.
arXiv Detail & Related papers (2024-02-13T18:51:18Z) - MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models [73.86954509967416]
Multimodal Large Language Model (MLLM) relies on the powerful LLM to perform multimodal tasks.
This paper presents the first comprehensive MLLM Evaluation benchmark MME.
It measures both perception and cognition abilities on a total of 14 subtasks.
arXiv Detail & Related papers (2023-06-23T09:22:36Z) - LVLM-eHub: A Comprehensive Evaluation Benchmark for Large
Vision-Language Models [55.304181390027274]
This paper presents a comprehensive evaluation of publicly available large multimodal models by building a LVLM evaluation Hub (LVLM-eHub)
Our LVLM-eHub consists of $8$ representative LVLMs such as InstructBLIP and MiniGPT-4, which are thoroughly evaluated by a quantitative capability evaluation and an online arena platform.
The study reveals several innovative findings. First, instruction-tuned LVLM with massive in-domain data such as InstructBLIP heavily overfits many existing tasks, generalizing poorly in the open-world scenario.
arXiv Detail & Related papers (2023-06-15T16:39:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.