Exploring the Reasoning Abilities of Multimodal Large Language Models
(MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning
- URL: http://arxiv.org/abs/2401.06805v2
- Date: Thu, 18 Jan 2024 07:31:47 GMT
- Title: Exploring the Reasoning Abilities of Multimodal Large Language Models
(MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning
- Authors: Yiqi Wang, Wentao Chen, Xiaotian Han, Xudong Lin, Haiteng Zhao,
Yongfei Liu, Bohan Zhai, Jianbo Yuan, Quanzeng You, Hongxia Yang
- Abstract summary: We review the existing evaluation protocols of multimodal reasoning, categorize and illustrate the frontiers of Multimodal Large Language Models (MLLMs)
We introduce recent trends in applications of MLLMs on reasoning-intensive tasks and discuss current practices and future directions.
- Score: 44.12214030785711
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Strong Artificial Intelligence (Strong AI) or Artificial General Intelligence
(AGI) with abstract reasoning ability is the goal of next-generation AI. Recent
advancements in Large Language Models (LLMs), along with the emerging field of
Multimodal Large Language Models (MLLMs), have demonstrated impressive
capabilities across a wide range of multimodal tasks and applications.
Particularly, various MLLMs, each with distinct model architectures, training
data, and training stages, have been evaluated across a broad range of MLLM
benchmarks. These studies have, to varying degrees, revealed different aspects
of the current capabilities of MLLMs. However, the reasoning abilities of MLLMs
have not been systematically investigated. In this survey, we comprehensively
review the existing evaluation protocols of multimodal reasoning, categorize
and illustrate the frontiers of MLLMs, introduce recent trends in applications
of MLLMs on reasoning-intensive tasks, and finally discuss current practices
and future directions. We believe our survey establishes a solid base and sheds
light on this important topic, multimodal reasoning.
Related papers
- A Survey on Multimodal Benchmarks: In the Era of Large AI Models [13.299775710527962]
Multimodal Large Language Models (MLLMs) have brought substantial advancements in artificial intelligence.
This survey systematically reviews 211 benchmarks that assess MLLMs across four core domains: understanding, reasoning, generation, and application.
arXiv Detail & Related papers (2024-09-21T15:22:26Z) - Surveying the MLLM Landscape: A Meta-Review of Current Surveys [17.372501468675303]
Multimodal Large Language Models (MLLMs) have become a transformative force in the field of artificial intelligence.
This survey aims to provide a systematic review of benchmark tests and evaluation methods for MLLMs.
arXiv Detail & Related papers (2024-09-17T14:35:38Z) - A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks [74.52259252807191]
Multimodal Large Language Models (MLLMs) address the complexities of real-world applications far beyond the capabilities of single-modality systems.
This paper systematically sorts out the applications of MLLM in multimodal tasks such as natural language, vision, and audio.
arXiv Detail & Related papers (2024-08-02T15:14:53Z) - Needle In A Multimodal Haystack [79.81804334634408]
We present the first benchmark specifically designed to evaluate the capability of existing MLLMs to comprehend long multimodal documents.
Our benchmark includes three types of evaluation tasks: multimodal retrieval, counting, and reasoning.
We observe that existing models still have significant room for improvement on these tasks, especially on vision-centric evaluation.
arXiv Detail & Related papers (2024-06-11T13:09:16Z) - LLMs Meet Multimodal Generation and Editing: A Survey [89.76691959033323]
This survey elaborates on multimodal generation and editing across various domains, comprising image, video, 3D, and audio.
We summarize the notable advancements with milestone works in these fields and categorize these studies into LLM-based and CLIP/T5-based methods.
We dig into tool-augmented multimodal agents that can leverage existing generative models for human-computer interaction.
arXiv Detail & Related papers (2024-05-29T17:59:20Z) - Efficient Multimodal Large Language Models: A Survey [60.7614299984182]
Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in tasks such as visual question answering, visual understanding and reasoning.
The extensive model size and high training and inference costs have hindered the widespread application of MLLMs in academia and industry.
This survey provides a comprehensive and systematic review of the current state of efficient MLLMs.
arXiv Detail & Related papers (2024-05-17T12:37:10Z) - On the Performance of Multimodal Language Models [4.677125897916577]
This study conducts a comparative analysis of different multimodal instruction tuning approaches.
We reveal key insights for guiding architectural choices when incorporating multimodal capabilities into large language models.
arXiv Detail & Related papers (2023-10-04T23:33:36Z) - A Survey on Multimodal Large Language Models [71.63375558033364]
Multimodal Large Language Model (MLLM) represented by GPT-4V has been a new rising research hotspot.
This paper aims to trace and summarize the recent progress of MLLMs.
arXiv Detail & Related papers (2023-06-23T15:21:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.