PCA-Bench: Evaluating Multimodal Large Language Models in
Perception-Cognition-Action Chain
- URL: http://arxiv.org/abs/2402.15527v1
- Date: Wed, 21 Feb 2024 07:09:58 GMT
- Title: PCA-Bench: Evaluating Multimodal Large Language Models in
Perception-Cognition-Action Chain
- Authors: Liang Chen and Yichi Zhang and Shuhuai Ren and Haozhe Zhao and Zefan
Cai and Yuchi Wang and Peiyi Wang and Xiangdi Meng and Tianyu Liu and Baobao
Chang
- Abstract summary: We present PCA-Bench, a benchmark for evaluating the integrated capabilities of Multimodal Large Language Models (MLLMs)
Given task instructions and diverse contexts, the model is required to seamlessly integrate Perception, Cognition, and Action in a reasoning chain.
We propose PCA-Eval, an automatic evaluation protocol, and assess 10 prevalent MLLMs.
- Score: 37.448177723993346
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present PCA-Bench, a multimodal decision-making benchmark for evaluating
the integrated capabilities of Multimodal Large Language Models (MLLMs).
Departing from previous benchmarks focusing on simplistic tasks and individual
model capability, PCA-Bench introduces three complex scenarios: autonomous
driving, domestic robotics, and open-world games. Given task instructions and
diverse contexts, the model is required to seamlessly integrate multiple
capabilities of Perception, Cognition, and Action in a reasoning chain to make
accurate decisions. Moreover, PCA-Bench features error localization
capabilities, scrutinizing model inaccuracies in areas such as perception,
knowledge, or reasoning. This enhances the reliability of deploying MLLMs. To
balance accuracy and efficiency in evaluation, we propose PCA-Eval, an
automatic evaluation protocol, and assess 10 prevalent MLLMs. The results
reveal significant performance disparities between open-source models and
powerful proprietary models like GPT-4 Vision. To address this, we introduce
Embodied-Instruction-Evolution (EIE), an automatic framework for synthesizing
instruction tuning examples in multimodal embodied environments. EIE generates
7,510 training examples in PCA-Bench and enhances the performance of
open-source MLLMs, occasionally surpassing GPT-4 Vision (+3\% in decision
accuracy), thereby validating the effectiveness of EIE. Our findings suggest
that robust MLLMs like GPT4-Vision show promise for decision-making in embodied
agents, opening new avenues for MLLM research.
Related papers
- Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance [78.48606021719206]
Mini-InternVL is a series of MLLMs with parameters ranging from 1B to 4B, which achieves 90% of the performance with only 5% of the parameters.
We develop a unified adaptation framework for Mini-InternVL, which enables our models to transfer and outperform specialized models in downstream tasks.
arXiv Detail & Related papers (2024-10-21T17:58:20Z) - Large Language Model Evaluation Via Multi AI Agents: Preliminary results [3.8066447473175304]
We introduce a novel multi-agent AI model that aims to assess and compare the performance of various Large Language Models (LLMs)
Our model consists of eight distinct AI agents, each responsible for retrieving code based on a common description from different advanced language models.
We integrate the HumanEval benchmark into our verification agent to assess the generated code's performance, providing insights into their respective capabilities and efficiencies.
arXiv Detail & Related papers (2024-04-01T10:06:04Z) - MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria [49.500322937449326]
Multimodal large language models (MLLMs) have broadened the scope of AI applications.
Existing automatic evaluation methodologies for MLLMs are mainly limited in evaluating queries without considering user experiences.
We propose a new evaluation paradigm for MLLMs, which is evaluating MLLMs with per-sample criteria using potent MLLM as the judge.
arXiv Detail & Related papers (2023-11-23T12:04:25Z) - MAgIC: Investigation of Large Language Model Powered Multi-Agent in
Cognition, Adaptability, Rationality and Collaboration [102.41118020705876]
Large Language Models (LLMs) have marked a significant advancement in the field of natural language processing.
As their applications extend into multi-agent environments, a need has arisen for a comprehensive evaluation framework.
This work introduces a novel benchmarking framework specifically tailored to assess LLMs within multi-agent settings.
arXiv Detail & Related papers (2023-11-14T21:46:27Z) - ChEF: A Comprehensive Evaluation Framework for Standardized Assessment
of Multimodal Large Language Models [49.48109472893714]
Multimodal Large Language Models (MLLMs) have shown impressive abilities in interacting with visual content with myriad potential downstream tasks.
We present the first Comprehensive Evaluation Framework (ChEF) that can holistically profile each MLLM and fairly compare different MLLMs.
We will publicly release all the detailed implementations for further analysis, as well as an easy-to-use modular toolkit for the integration of new recipes and models.
arXiv Detail & Related papers (2023-11-05T16:01:40Z) - MM-BigBench: Evaluating Multimodal Models on Multimodal Content
Comprehension Tasks [56.60050181186531]
We introduce MM-BigBench, which incorporates a diverse range of metrics to offer an extensive evaluation of the performance of various models and instructions.
Our paper evaluates a total of 20 language models (14 MLLMs) on 14 multimodal datasets spanning 6 tasks, with 10 instructions for each task, and derives novel insights.
arXiv Detail & Related papers (2023-10-13T11:57:04Z) - Towards End-to-End Embodied Decision Making via Multi-modal Large
Language Model: Explorations with GPT4-Vision and Beyond [38.85644950457275]
We investigate whether state-of-the-art MLLMs can handle embodied decision-making in an end-to-end manner.
Our results indicate that powerful MLLMs like GPT4-Vision hold promise for decision-making in embodied agents.
arXiv Detail & Related papers (2023-10-03T14:13:36Z) - MLModelScope: A Distributed Platform for Model Evaluation and
Benchmarking at Scale [32.62513495487506]
Machine Learning (ML) and Deep Learning (DL) innovations are being introduced at such a rapid pace that researchers are hard-pressed to analyze and study them.
The complicated procedures for evaluating innovations, along with the lack of standard and efficient ways of specifying and provisioning ML/DL evaluation, is a major "pain point" for the community.
This paper proposes MLModelScope, an open-source, framework/ hardware agnostic, and customizable design that enables repeatable, fair, and scalable model evaluation and benchmarking.
arXiv Detail & Related papers (2020-02-19T17:13:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.