MFE-ETP: A Comprehensive Evaluation Benchmark for Multi-modal Foundation Models on Embodied Task Planning
- URL: http://arxiv.org/abs/2407.05047v3
- Date: Mon, 7 Oct 2024 14:05:05 GMT
- Title: MFE-ETP: A Comprehensive Evaluation Benchmark for Multi-modal Foundation Models on Embodied Task Planning
- Authors: Min Zhang, Xian Fu, Jianye Hao, Peilong Han, Hao Zhang, Lei Shi, Hongyao Tang, Yan Zheng,
- Abstract summary: We provide an in-depth and comprehensive evaluation of the performance of MFMs on embodied task planning.
We propose a new benchmark, named MFE-ETP, characterized its complex and variable task scenarios.
Using the benchmark and evaluation platform, we evaluated several state-of-the-art MFMs and found that they significantly lag behind human-level performance.
- Score: 50.45558735526665
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, Multi-modal Foundation Models (MFMs) and Embodied Artificial Intelligence (EAI) have been advancing side by side at an unprecedented pace. The integration of the two has garnered significant attention from the AI research community. In this work, we attempt to provide an in-depth and comprehensive evaluation of the performance of MFM s on embodied task planning, aiming to shed light on their capabilities and limitations in this domain. To this end, based on the characteristics of embodied task planning, we first develop a systematic evaluation framework, which encapsulates four crucial capabilities of MFMs: object understanding, spatio-temporal perception, task understanding, and embodied reasoning. Following this, we propose a new benchmark, named MFE-ETP, characterized its complex and variable task scenarios, typical yet diverse task types, task instances of varying difficulties, and rich test case types ranging from multiple embodied question answering to embodied task reasoning. Finally, we offer a simple and easy-to-use automatic evaluation platform that enables the automated testing of multiple MFMs on the proposed benchmark. Using the benchmark and evaluation platform, we evaluated several state-of-the-art MFMs and found that they significantly lag behind human-level performance. The MFE-ETP is a high-quality, large-scale, and challenging benchmark relevant to real-world tasks.
Related papers
- MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency [63.23935582919081]
Chain-of-Thought (CoT) has significantly enhanced the reasoning capabilities of Large Language Models (LLMs)
We introduce MME-CoT, a specialized benchmark evaluating the CoT reasoning performance of LMMs.
We conduct an in-depth analysis of state-of-the-art LMMs, uncovering several key insights.
arXiv Detail & Related papers (2025-02-13T18:59:46Z) - EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents [57.4686961979566]
EmbodiedEval is a comprehensive and interactive evaluation benchmark for MLLMs with embodied tasks.
It covers a broad spectrum of existing embodied AI tasks with significantly enhanced diversity.
We evaluated the state-of-the-art MLLMs on EmbodiedEval and found that they have a significant shortfall compared to human level on embodied tasks.
arXiv Detail & Related papers (2025-01-21T03:22:10Z) - MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs [97.94579295913606]
Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia.
In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models.
This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods.
arXiv Detail & Related papers (2024-11-22T18:59:54Z) - ET-Plan-Bench: Embodied Task-level Planning Benchmark Towards Spatial-Temporal Cognition with Foundation Models [38.89166693142495]
ET-Plan-Bench is a benchmark for embodied task planning using Large Language Models (LLMs)
It features a controllable and diverse set of embodied tasks varying in different levels of difficulties and complexities.
Our benchmark distinguishes itself as a large-scale, quantifiable, highly automated, and fine-grained diagnostic framework.
arXiv Detail & Related papers (2024-10-02T19:56:38Z) - A Survey on Multimodal Benchmarks: In the Era of Large AI Models [13.299775710527962]
Multimodal Large Language Models (MLLMs) have brought substantial advancements in artificial intelligence.
This survey systematically reviews 211 benchmarks that assess MLLMs across four core domains: understanding, reasoning, generation, and application.
arXiv Detail & Related papers (2024-09-21T15:22:26Z) - Assessing Modality Bias in Video Question Answering Benchmarks with Multimodal Large Language Models [12.841405829775852]
We introduce the modality importance score (MIS) to identify bias inVidQA benchmarks and datasets.
We also propose an innovative method using state-of-the-art MLLMs to estimate the modality importance.
Our results indicate that current models do not effectively integrate information due to modality imbalance in existing datasets.
arXiv Detail & Related papers (2024-08-22T23:32:42Z) - Meta Reasoning for Large Language Models [58.87183757029041]
We introduce Meta-Reasoning Prompting (MRP), a novel and efficient system prompting method for large language models (LLMs)
MRP guides LLMs to dynamically select and apply different reasoning methods based on the specific requirements of each task.
We evaluate the effectiveness of MRP through comprehensive benchmarks.
arXiv Detail & Related papers (2024-06-17T16:14:11Z) - MMCTAgent: Multi-modal Critical Thinking Agent Framework for Complex Visual Reasoning [3.651416979200174]
MMCTAgent is a novel critical thinking agent framework designed to address the inherent limitations of current MLLMs in complex visual reasoning tasks.
Inspired by human cognitive processes and critical thinking, MMCTAgent iteratively analyzes multi-modal information, decomposes queries, plans strategies, and dynamically evolves its reasoning.
arXiv Detail & Related papers (2024-05-28T16:55:41Z) - MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities [153.37868034779385]
We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks.
Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes.
arXiv Detail & Related papers (2023-08-04T17:59:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.