M$^3$CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought
- URL: http://arxiv.org/abs/2405.16473v1
- Date: Sun, 26 May 2024 07:56:30 GMT
- Title: M$^3$CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought
- Authors: Qiguang Chen, Libo Qin, Jin Zhang, Zhi Chen, Xiao Xu, Wanxiang Che,
- Abstract summary: Multi-modal Chain-of-Thought (MCoT) requires models to leverage knowledge from both textual and visual modalities for step-by-step reasoning.
The current MCoT benchmark still faces some challenges: (1) absence of visual modal reasoning, (2) single-step visual modal reasoning, and (3) Domain missing.
We introduce a novel benchmark (M$3$CoT) to address the above challenges, advancing the multi-domain, multi-step, and multi-modal CoT.
- Score: 50.576016777061724
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Multi-modal Chain-of-Thought (MCoT) requires models to leverage knowledge from both textual and visual modalities for step-by-step reasoning, which gains increasing attention. Nevertheless, the current MCoT benchmark still faces some challenges: (1) absence of visual modal reasoning, (2) single-step visual modal reasoning, and (3) Domain missing, thereby hindering the development of MCoT. Motivated by this, we introduce a novel benchmark (M$^3$CoT) to address the above challenges, advancing the multi-domain, multi-step, and multi-modal CoT. Additionally, we conduct a thorough evaluation involving abundant MCoT approaches on Vision Large Language Models (VLLMs). In addition, we highlight that the current VLLMs still struggle to correctly reason in M$^3$CoT and there remains a large gap between existing VLLMs and human performance in M$^3$CoT, despite their superior results on previous MCoT benchmarks. To our knowledge, we take the first meaningful step toward the multi-domain, multi-step, and multi-modal scenario in MCoT. We hope that M$^3$CoT can serve as a valuable resource, providing a pioneering foundation in multi-domain, multi-step, multi-modal chain-of-thought research.
Related papers
- Multi-modal Retrieval Augmented Multi-modal Generation: A Benchmark, Evaluate Metrics and Strong Baselines [63.427721165404634]
This paper investigates an intriguing task of Multi-modal Retrieval Augmented Multi-modal Generation (M$2$RAG)
This task requires foundation models to browse multi-modal web pages, with mixed text and images, and generate multi-modal responses for solving user queries.
We construct a benchmark for M$2$RAG task, equipped with a suite of text-modal metrics and multi-modal metrics to analyze the capabilities of existing foundation models.
arXiv Detail & Related papers (2024-11-25T13:20:19Z) - M3Hop-CoT: Misogynous Meme Identification with Multimodal Multi-hop Chain-of-Thought [21.06134139986278]
We introduce a Multimodal Multi-hop CoT (M3Hop-CoT) framework for Misogynous meme identification.
M3Hop-CoT employs a three-step multimodal prompting principle to induce emotions, target awareness, and contextual knowledge for meme analysis.
We evaluate the model's generalizability by evaluating it on various benchmark meme datasets.
arXiv Detail & Related papers (2024-10-11T19:50:53Z) - Needle In A Multimodal Haystack [79.81804334634408]
We present the first benchmark specifically designed to evaluate the capability of existing MLLMs to comprehend long multimodal documents.
Our benchmark includes three types of evaluation tasks: multimodal retrieval, counting, and reasoning.
We observe that existing models still have significant room for improvement on these tasks, especially on vision-centric evaluation.
arXiv Detail & Related papers (2024-06-11T13:09:16Z) - MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI [71.53579367538725]
MMT-Bench is a benchmark designed to assess Large Vision-Language Models (LVLMs) across massive multimodal tasks.
MMT-Bench comprises $31,325$ meticulously curated multi-choice visual questions from various multimodal scenarios.
arXiv Detail & Related papers (2024-04-24T17:37:05Z) - Toward Robust Multimodal Learning using Multimodal Foundational Models [30.755818450393637]
We propose TRML, Toward Robust Multimodal Learning using Multimodal Foundational Models.
TRML employs generated virtual modalities to replace missing modalities.
We also design a semantic matching learning module to align semantic spaces generated and missing modalities.
arXiv Detail & Related papers (2024-01-20T04:46:43Z) - A Survey on Multimodal Large Language Models [71.63375558033364]
Multimodal Large Language Model (MLLM) represented by GPT-4V has been a new rising research hotspot.
This paper aims to trace and summarize the recent progress of MLLMs.
arXiv Detail & Related papers (2023-06-23T15:21:52Z) - Missing Modality Robustness in Semi-Supervised Multi-Modal Semantic
Segmentation [27.23513712371972]
We propose a simple yet efficient multi-modal fusion mechanism Linear Fusion.
We also propose M3L: Multi-modal Teacher for Masked Modality Learning.
Our proposal shows an absolute improvement of up to 10% on robust mIoU above the most competitive baselines.
arXiv Detail & Related papers (2023-04-21T05:52:50Z) - Multimodal Chain-of-Thought Reasoning in Language Models [94.70184390935661]
We propose Multimodal-CoT that incorporates language (text) and vision (images) modalities into a two-stage framework.
Experimental results on ScienceQA and A-OKVQA benchmark datasets show the effectiveness of our proposed approach.
arXiv Detail & Related papers (2023-02-02T07:51:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.