Related papers: EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning

EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning

URL: http://arxiv.org/abs/2312.06722v3
Date: Tue, 11 Jun 2024 06:53:44 GMT
Title: EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning
Authors: Yi Chen, Yuying Ge, Yixiao Ge, Mingyu Ding, Bohao Li, Rui Wang, Ruifeng Xu, Ying Shan, Xihui Liu,
Abstract summary: We introduce EgoPlan-Bench, a benchmark to evaluate the planning abilities of MLLMs in real-world scenarios. We show that EgoPlan-Bench poses significant challenges, highlighting a substantial scope for improvement in MLLMs to achieve human-level task planning. We also present EgoPlan-IT, a specialized instruction-tuning dataset that effectively enhances model performance on EgoPlan-Bench.
Score: 84.6451394629312
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The pursuit of artificial general intelligence (AGI) has been accelerated by Multimodal Large Language Models (MLLMs), which exhibit superior reasoning, generalization capabilities, and proficiency in processing multimodal inputs. A crucial milestone in the evolution of AGI is the attainment of human-level planning, a fundamental ability for making informed decisions in complex environments, and solving a wide range of real-world problems. Despite the impressive advancements in MLLMs, a question remains: How far are current MLLMs from achieving human-level planning? To shed light on this question, we introduce EgoPlan-Bench, a comprehensive benchmark to evaluate the planning abilities of MLLMs in real-world scenarios from an egocentric perspective, mirroring human perception. EgoPlan-Bench emphasizes the evaluation of planning capabilities of MLLMs, featuring realistic tasks, diverse action plans, and intricate visual observations. Our rigorous evaluation of a wide range of MLLMs reveals that EgoPlan-Bench poses significant challenges, highlighting a substantial scope for improvement in MLLMs to achieve human-level task planning. To facilitate this advancement, we further present EgoPlan-IT, a specialized instruction-tuning dataset that effectively enhances model performance on EgoPlan-Bench. We have made all codes, data, and a maintained benchmark leaderboard available to advance future research.

Related papers

Plan Your Travel and Travel with Your Plan: Wide-Horizon Planning and Evaluation via LLM [58.50687282180444]
Travel planning is a complex task requiring the integration of diverse real-world information and user preferences.<n>We formulate this as an $L3$ planning problem, emphasizing long context, long instruction, and long output.<n>We introduce Multiple Aspects of Planning (MAoP), enabling LLMs to conduct wide-horizon thinking to solve complex planning problems.
arXiv Detail & Related papers (2025-06-14T09:37:59Z)
Zero-shot Robotic Manipulation with Language-guided Instruction and Formal Task Planning [16.89900521727246]
We propose an innovative language-guided symbolic task planning (LM-SymOpt) framework with optimization. It is the first expert-free planning framework since we combine the world knowledge from Large Language Models with formal reasoning. Our experimental results show that LM-SymOpt outperforms existing LLM-based planning approaches.
arXiv Detail & Related papers (2025-01-25T13:33:22Z)
EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios [53.26658545922884]
We introduce EgoPlan-Bench2, a benchmark designed to assess the planning capabilities of MLLMs across a wide range of real-world scenarios. We evaluate 21 competitive MLLMs and provide an in-depth analysis of their limitations, revealing that they face significant challenges in real-world planning. Our approach enhances the performance of GPT-4V by 10.24 on EgoPlan-Bench2 without additional training.
arXiv Detail & Related papers (2024-12-05T18:57:23Z)
A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks [74.52259252807191]
Multimodal Large Language Models (MLLMs) address the complexities of real-world applications far beyond the capabilities of single-modality systems. This paper systematically sorts out the applications of MLLM in multimodal tasks such as natural language, vision, and audio.
arXiv Detail & Related papers (2024-08-02T15:14:53Z)
VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs [102.36953558562436]
Vision language models (VLMs) are an exciting emerging class of language models (LMs) One understudied capability inVLMs is visual spatial planning. Our study introduces a benchmark that evaluates the spatial planning capability in these models in general.
arXiv Detail & Related papers (2024-07-02T00:24:01Z)
Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [53.6472920229013]
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks. LLMs are prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning. We introduce Q*, a framework for guiding LLMs decoding process with deliberative planning.
arXiv Detail & Related papers (2024-06-20T13:08:09Z)
Exploring and Benchmarking the Planning Capabilities of Large Language Models [57.23454975238014]
This work lays the foundations for improving planning capabilities of large language models (LLMs) We construct a comprehensive benchmark suite encompassing both classical planning benchmarks and natural language scenarios. We investigate the use of many-shot in-context learning to enhance LLM planning, exploring the relationship between increased context length and improved planning performance.
arXiv Detail & Related papers (2024-06-18T22:57:06Z)
What's the Plan? Evaluating and Developing Planning-Aware Techniques for Language Models [7.216683826556268]
Large language models (LLMs) are increasingly used for applications that require planning capabilities. We introduce SimPlan, a novel hybrid-method, and evaluate its performance in a new challenging setup.
arXiv Detail & Related papers (2024-02-18T07:42:49Z)
Understanding the planning of LLM agents: A survey [98.82513390811148]
This survey provides the first systematic view of LLM-based agents planning, covering recent works aiming to improve planning ability. Comprehensive analyses are conducted for each direction, and further challenges in the field of research are discussed.
arXiv Detail & Related papers (2024-02-05T04:25:24Z)
On the Prospects of Incorporating Large Language Models (LLMs) in Automated Planning and Scheduling (APS) [23.024862968785147]
This paper investigates eight categories based on the unique applications of LLMs in addressing various aspects of planning problems. A critical insight resulting from our review is that the true potential of LLMs unfolds when they are integrated with traditional symbolic planners.
arXiv Detail & Related papers (2024-01-04T19:22:09Z)
Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning [32.045840007623276]
We introduce Robotic Vision-Language Planning (ViLa), a novel approach for long-horizon robotic planning. ViLa directly integrates perceptual data into its reasoning and planning process. Our evaluation, conducted in both real-robot and simulated environments, demonstrates ViLa's superiority over existing LLM-based planners.
arXiv Detail & Related papers (2023-11-29T17:46:25Z)
Plansformer: Generating Symbolic Plans using Transformers [24.375997526106246]
Large Language Models (LLMs) have been the subject of active research, significantly advancing the field of Natural Language Processing (NLP) We introduce Plansformer; an LLM fine-tuned on planning problems and capable of generating plans with favorable behavior in terms of correctness and length with reduced knowledge-engineering efforts. For one configuration of Plansformer, we achieve 97% valid plans, out of which 95% are optimal for Towers of Hanoi - a puzzle-solving domain.
arXiv Detail & Related papers (2022-12-16T19:06:49Z)
PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change [34.93870615625937]
PlanBench is a benchmark suite based on the kinds of domains used in the automated planning community. PlanBench provides sufficient diversity in both the task domains and the specific planning capabilities.
arXiv Detail & Related papers (2022-06-21T16:15:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.