MMCR: Advancing Visual Language Model in Multimodal Multi-Turn Contextual Reasoning
- URL: http://arxiv.org/abs/2503.18533v1
- Date: Mon, 24 Mar 2025 10:40:33 GMT
- Title: MMCR: Advancing Visual Language Model in Multimodal Multi-Turn Contextual Reasoning
- Authors: Dawei Yan, Yang Li, Qing-Guo Chen, Weihua Luo, Peng Wang, Haokui Zhang, Chunhua Shen,
- Abstract summary: Multimodal Multi-turn Contextual Reasoning dataset is largest multi-image multi-turn instruction tuning dataset with 310K contextual dialogues.<n>Models fine-tuned with MMCR-310k achieve 5.2% higher contextual accuracy on MMCR-Bench.
- Score: 59.01443478716538
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Compared to single-turn dialogue, multi-turn dialogue involving multiple images better aligns with the needs of real-world human-AI interactions. Additionally, as training data, it provides richer contextual reasoning information, thereby guiding the model to achieve better performance. However, existing vision-language models (VLMs) primarily rely on single-turn dialogue training and evaluation benchmarks. In this paper, following the characteristics of human dialogue, such as focused topics and concise, clear content, we present MMCR (Multimodal Multi-turn Contextual Reasoning), a novel dataset comprising: (1) MMCR-310k -- the largest multi-image multi-turn instruction tuning dataset with 310K contextual dialogues, each covering 1-4 images and 4 or 8 dialogue turns; and (2) MMCR-Bench -- a diagnostic benchmark featuring dialogues, spanning 8 domains (Humanities, Natural, Science, Education, etc.) and 40 sub-topics. Extensive evaluations demonstrate that models fine-tuned with MMCR-310k achieve 5.2\% higher contextual accuracy on MMCR-Bench, while showing consistent improvements on existing benchmarks (+1.1\% on AI2D, +1.2\% on MMMU and MMVet). MMCR and prompt engineering will be released publicly.
Related papers
- MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark [77.93283927871758]
This paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning benchmark.
MMMU-Pro rigorously assesses multimodal models' true understanding and reasoning capabilities.
arXiv Detail & Related papers (2024-09-04T15:31:26Z) - MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets [29.737965533532577]
Multimodal Augmented Generative Images Dialogues (MAGID) is a framework to augment text-only dialogues with diverse and high-quality images.
Our results show that MAGID is comparable to or better than baselines, with significant improvements in human evaluation.
arXiv Detail & Related papers (2024-03-05T18:31:28Z) - Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models [60.81438804824749]
Multimodal instruction-following models extend capabilities by integrating both text and images.
Existing models such as MiniGPT-4 and LLaVA face challenges in maintaining dialogue coherence in scenarios involving multiple images.
We introduce SparklesDialogue, the first machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions.
We then present SparklesChat, a multimodal instruction-following model for open-ended dialogues across multiple images.
arXiv Detail & Related papers (2023-08-31T05:15:27Z) - OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks.
OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z) - DialogCC: An Automated Pipeline for Creating High-Quality Multi-Modal Dialogue Dataset [18.449076451976236]
In this paper, we propose an automated pipeline to construct a multi-modal dialogue dataset.
In our pipeline, to guarantee the coherence between images and dialogue, we prompt GPT-4 to infer potential image-sharing moments.
Through this pipeline, we introduce DialogCC, a high-quality and diverse multi-modal dialogue dataset.
arXiv Detail & Related papers (2022-12-08T07:29:07Z) - MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal
Open-domain Conversation [68.53133207668856]
We introduce the MMDialog dataset to better facilitate multi-modal conversation.
MMDialog is composed of a curated set of 1.08 million real-world dialogues with 1.53 million unique images across 4,184 topics.
To build engaging dialogue system with this dataset, we propose and normalize two response producing tasks.
arXiv Detail & Related papers (2022-11-10T17:37:04Z) - Multimodal Dialogue Response Generation [27.611204319057393]
We present a multimodal dialogue generation model, which takes the dialogue history as input, then generates a textual sequence or an image as response.
We consider multimodal dialogue generation under a natural assumption that only limited training examples are available.
In such a low-resource setting, we devise a novel conversational agent, Divter, in order to isolate parameters that depend on multimodal dialogues from the entire model.
arXiv Detail & Related papers (2021-10-16T08:52:26Z) - Modality-Balanced Models for Visual Dialogue [102.35406085738325]
The Visual Dialog task requires a model to exploit both image and conversational context information to generate the next response to the dialogue.
We show that previous joint-modality (history and image) models over-rely on and are more prone to memorizing the dialogue history.
We present methods for this integration of the two models, via ensemble and consensus dropout fusion with shared parameters.
arXiv Detail & Related papers (2020-01-17T14:57:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.