MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation
- URL: http://arxiv.org/abs/2502.11903v2
- Date: Sat, 08 Mar 2025 12:36:34 GMT
- Title: MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation
- Authors: Haochen Xue, Feilong Tang, Ming Hu, Yexin Liu, Qidong Huang, Yulong Li, Chengzhi Liu, Zhongxing Xu, Chong Zhang, Chun-Mei Feng, Yutong Xie, Imran Razzak, Zongyuan Ge, Jionglong Su, Junjun He, Yu Qiao,
- Abstract summary: This paper introduces MMRC, a benchmark for evaluating six core open-ended abilities of MLLMs.<n> Evaluations on 20 MLLMs in MMRC indicate an accuracy drop during open-ended interactions.<n>We propose a simple yet effective NOTE-TAKING strategy, which can record key information from the conversation and remind the model during its responses.
- Score: 52.35744453954844
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent multimodal large language models (MLLMs) have demonstrated significant potential in open-ended conversation, generating more accurate and personalized responses. However, their abilities to memorize, recall, and reason in sustained interactions within real-world scenarios remain underexplored. This paper introduces MMRC, a Multi-Modal Real-world Conversation benchmark for evaluating six core open-ended abilities of MLLMs: information extraction, multi-turn reasoning, information update, image management, memory recall, and answer refusal. With data collected from real-world scenarios, MMRC comprises 5,120 conversations and 28,720 corresponding manually labeled questions, posing a significant challenge to existing MLLMs. Evaluations on 20 MLLMs in MMRC indicate an accuracy drop during open-ended interactions. We identify four common failure patterns: long-term memory degradation, inadequacies in updating factual knowledge, accumulated assumption of error propagation, and reluctance to say no. To mitigate these issues, we propose a simple yet effective NOTE-TAKING strategy, which can record key information from the conversation and remind the model during its responses, enhancing conversational capabilities. Experiments across six MLLMs demonstrate significant performance improvements.
Related papers
- Grounded Chain-of-Thought for Multimodal Large Language Models [66.04061083611863]
We propose a new learning task for multimodal large language models (MLLMs) called Grounded Chain-of-Thought (GCoT)
GCoT is keen to helping MLLMs to recognize and ground the relevant visual cues step by step, thereby predicting the correct answer with grounding coordinates as the intuitive basis.
To facilitate this task, we also carefully design and construct a dataset called multimodal grounded chain-of-thought (MM-GCoT) consisting of 24,022 GCoT examples for 5,033 images.
arXiv Detail & Related papers (2025-03-17T04:07:47Z) - Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning [32.95008932216176]
We introduce MMDiag, a multi-turn multimodal dialogue dataset.
We also present DiagNote, an MLLM equipped with multimodal grounding and reasoning capabilities.
arXiv Detail & Related papers (2025-03-10T07:32:53Z) - HammerBench: Fine-Grained Function-Calling Evaluation in Real Mobile Device Scenarios [31.43638572775755]
HammerBench is a novel framework for assessing mobile assistant function-calling capabilities in real-world, multi-turn dialogues.<n>Our experiments reveal that different types of parameter name errors are a significant source of failure across different interaction scenarios.
arXiv Detail & Related papers (2024-12-21T07:33:55Z) - RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection [60.297079601066784]
We introduce ErrorRadar, the first benchmark designed to assess MLLMs' capabilities in error detection.
ErrorRadar evaluates two sub-tasks: error step identification and error categorization.
It consists of 2,500 high-quality multimodal K-12 mathematical problems, collected from real-world student interactions.
Results indicate significant challenges still remain, as GPT-4o with best performance is still around 10% behind human evaluation.
arXiv Detail & Related papers (2024-10-06T14:59:09Z) - MathChat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn Interactions [58.57255822646756]
This paper introduces MathChat, a benchmark designed to evaluate large language models (LLMs) across a broader spectrum of mathematical tasks.
We evaluate the performance of various SOTA LLMs on the MathChat benchmark, and we observe that while these models excel in single turn question answering, they significantly underperform in more complex scenarios.
We develop MathChat sync, a synthetic dialogue based math dataset for LLM finetuning, focusing on improving models' interaction and instruction following capabilities in conversations.
arXiv Detail & Related papers (2024-05-29T18:45:55Z) - MM-Soc: Benchmarking Multimodal Large Language Models in Social Media Platforms [25.73585435351771]
This paper introduces MM-Soc, a benchmark designed to evaluate Multimodal Large Language Models' understanding of social media content.
MM-Soc compiles prominent multimodal datasets and incorporates a novel large-scale YouTube tagging dataset.
Our analysis reveals that, in a zero-shot setting, various types of MLLMs generally exhibit difficulties in handling social media tasks.
arXiv Detail & Related papers (2024-02-21T22:27:40Z) - Temporal Insight Enhancement: Mitigating Temporal Hallucination in
Multimodal Large Language Models [20.33971942003996]
This study introduces an innovative method to address event-level hallucinations in MLLMs.
We propose a unique mechanism that decomposes on-demand event queries into iconic actions.
We employ models like CLIP and BLIP2 to predict specific timestamps for event occurrences.
arXiv Detail & Related papers (2024-01-18T10:18:48Z) - Exploring the Factual Consistency in Dialogue Comprehension of Large Language Models [51.75805497456226]
This work focuses on the factual consistency issue with the help of the dialogue summarization task.
Our evaluation shows that, on average, 26.8% of the summaries generated by LLMs contain factual inconsistency.
To stimulate and enhance the dialogue comprehension ability of LLMs, we propose a fine-tuning paradigm with auto-constructed multi-task data.
arXiv Detail & Related papers (2023-11-13T09:32:12Z) - Check Your Facts and Try Again: Improving Large Language Models with
External Knowledge and Automated Feedback [127.75419038610455]
Large language models (LLMs) are able to generate human-like, fluent responses for many downstream tasks.
This paper proposes a LLM-Augmenter system, which augments a black-box LLM with a set of plug-and-play modules.
arXiv Detail & Related papers (2023-02-24T18:48:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.