Fugu-MT 論文翻訳(概要): EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios

論文の概要: EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios

arxiv url: http://arxiv.org/abs/2412.04447v1
Date: Thu, 05 Dec 2024 18:57:23 GMT
ステータス: 翻訳完了
システム内更新日: 2024-12-06 20:43:02.207113
Title: EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios
Title（参考訳）: EgoPlan-Bench2: 実世界のシナリオにおけるマルチモーダルな大規模言語モデル計画のためのベンチマーク
Authors: Lu Qiu, Yuying Ge, Yi Chen, Yixiao Ge, Ying Shan, Xihui Liu,
Abstract要約: EgoPlan-Bench2は,MLLMの計画能力を評価するためのベンチマークである。我々は,21の競争的MLLMを評価し,その限界を詳細に分析した結果,実世界の計画において大きな課題に直面していることが明らかとなった。 EgoPlan-Bench2におけるGPT-4Vの10.24倍の性能向上を図る。
参考スコア（独自算出の注目度）: 53.26658545922884
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The advent of Multimodal Large Language Models, leveraging the power of Large Language Models, has recently demonstrated superior multimodal understanding and reasoning abilities, heralding a new era for artificial general intelligence. However, achieving AGI necessitates more than just comprehension and reasoning. A crucial capability required is effective planning in diverse scenarios, which involves making reasonable decisions based on complex environments to solve real-world problems. Despite its importance, the planning abilities of current MLLMs in varied scenarios remain underexplored. In this paper, we introduce EgoPlan-Bench2, a rigorous and comprehensive benchmark designed to assess the planning capabilities of MLLMs across a wide range of real-world scenarios. EgoPlan-Bench2 encompasses everyday tasks spanning 4 major domains and 24 detailed scenarios, closely aligned with human daily life. EgoPlan-Bench2 is constructed through a semi-automatic process utilizing egocentric videos, complemented by manual verification. Grounded in a first-person perspective, it mirrors the way humans approach problem-solving in everyday life. We evaluate 21 competitive MLLMs and provide an in-depth analysis of their limitations, revealing that they face significant challenges in real-world planning. To further improve the planning proficiency of current MLLMs, we propose a training-free approach using multimodal Chain-of-Thought (CoT) prompting through investigating the effectiveness of various multimodal prompts in complex planning. Our approach enhances the performance of GPT-4V by 10.24 on EgoPlan-Bench2 without additional training. Our work not only sheds light on the current limitations of MLLMs in planning, but also provides insights for future enhancements in this critical area. We have made data and code available at https://qiulu66.github.io/egoplanbench2/.
Abstract（参考訳）: 大規模言語モデルの力を活用した多モーダル大規模言語モデルの出現は、最近、人工知能の新しい時代を告げ、優れた多モーダル理解と推論能力を示した。しかし、AGIの達成には単なる理解と推論以上のものが必要である。必要な重要な能力は、さまざまなシナリオにおいて効果的な計画を立てることである。その重要性にもかかわらず、様々なシナリオにおける現在のMLLMの計画能力は未解明のままである。本稿では,MLLMの計画能力を評価するための厳密で包括的なベンチマークであるEgoPlan-Bench2を紹介する。 EgoPlan-Bench2は、4つの主要なドメインと24の詳細なシナリオにまたがる日々のタスクを包含する。 EgoPlan-Bench2は、エゴセントリックなビデオを利用して半自動で構築され、手動による検証によって補完される。第一の視点で見れば、それは人間の日常生活における問題解決のやり方を反映している。我々は,21の競争的MLLMを評価し,その限界を詳細に分析した結果,実世界の計画において大きな課題に直面していることが明らかとなった。現在のMLLMの計画精度をさらに向上するため,複雑計画における様々なマルチモーダルプロンプトの有効性を検証し,マルチモーダル・チェーン・オブ・ソート(CoT)を用いたトレーニングフリーアプローチを提案する。 EgoPlan-Bench2におけるGPT-4Vの10.24倍の性能向上を図る。私たちの研究は、計画におけるMLLMの現在の限界だけでなく、この重要な領域における将来の拡張に対する洞察も提供します。データとコードはhttps://qiulu66.github.io/egoplanbench2/で公開しています。

関連論文リスト

Plan Your Travel and Travel with Your Plan: Wide-Horizon Planning and Evaluation via LLM [58.50687282180444]
旅行計画は、多様な現実世界の情報とユーザの好みを統合する必要がある複雑な作業である。我々はこれをL3$プランニング問題として定式化し、長いコンテキスト、長い命令、長い出力を強調する。計画の多面的側面 (MAoP) を導入し, LLM が複雑な計画問題の解決のために広義の思考を行えるようにした。
論文参考訳（メタデータ） (2025-06-14T09:37:59Z)
Scaling Autonomous Agents via Automatic Reward Modeling And Planning [52.39395405893965]
大規模言語モデル(LLM)は、様々なタスクにまたがる顕著な機能を示している。しかし、彼らは多段階の意思決定と環境フィードバックを必要とする問題に苦戦している。人間のアノテーションを使わずに環境から報酬モデルを自動的に学習できるフレームワークを提案する。
論文参考訳（メタデータ） (2025-02-17T18:49:25Z)
MALMM: Multi-Agent Large Language Models for Zero-Shot Robotics Manipulation [52.739500459903724]
大規模言語モデル(LLM)は、ロボティクスの操作やナビゲーションなど、さまざまな領域にまたがる優れた計画能力を示している。特殊なLLMエージェント間で高レベル計画および低レベル制御コード生成を分散する新しいマルチエージェントLLMフレームワークを提案する。長軸タスクを含む9つのRLBenchタスクに対するアプローチを評価し、ゼロショット環境でロボット操作を解く能力を実証した。
論文参考訳（メタデータ） (2024-11-26T17:53:44Z)
A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks [74.52259252807191]
MLLM(Multimodal Large Language Models)は、単一のモダリティシステムの能力を超えた現実世界のアプリケーションの複雑さに対処する。本稿では,自然言語,視覚,音声などのマルチモーダルタスクにおけるMLLMの応用を体系的に整理する。
論文参考訳（メタデータ） (2024-08-02T15:14:53Z)
EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning [84.6451394629312]
実世界のシナリオにおけるMLLMの計画能力を評価するベンチマークであるEgoPlan-Benchを紹介する。 EgoPlan-Benchは、人間レベルのタスクプランニングを実現するためのMLLMの改善のかなりの範囲を浮き彫りにする。また,EgoPlan-Bench上でのモデル性能を効果的に向上する特殊命令チューニングデータセットであるEgoPlan-ITを提案する。
論文参考訳（メタデータ） (2023-12-11T03:35:58Z)
Improving Planning with Large Language Models: A Modular Agentic Architecture [7.63815864256878]
大規模言語モデル(LLM)は、多段階の推論や目標指向の計画を必要とするタスクに悩まされることが多い。本稿では,特殊モジュールの反復的相互作用によって計画が達成されるエージェントアーキテクチャ,MAPを提案する。 MAPは両方の標準LLM法よりも大幅に改善されていることがわかった。
論文参考訳（メタデータ） (2023-09-30T00:10:14Z)
AdaPlanner: Adaptive Planning from Feedback with Language Models [56.367020818139665]
大規模言語モデル(LLM)は、最近、シーケンシャルな意思決定タスクの自律的エージェントとして機能する可能性を実証している。本研究では,LLMエージェントが環境フィードバックに応じて自己生成計画を適応的に改善することのできるクローズドループアプローチであるAdaPlannerを提案する。幻覚を緩和するために,様々なタスク,環境,エージェント機能にまたがる計画生成を容易にするコードスタイルのLCMプロンプト構造を開発した。
論文参考訳（メタデータ） (2023-05-26T05:52:27Z)
Understanding the Capabilities of Large Language Models for Automated Planning [24.37599752610625]
この研究は、複雑な計画問題の解決におけるLLMの能力に光を当てようとしている。この文脈で LLM を使用するための最も効果的なアプローチに関する洞察を提供する。
論文参考訳（メタデータ） (2023-05-25T15:21:09Z)
Plansformer: Generating Symbolic Plans using Transformers [24.375997526106246]
大規模言語モデル(LLM)は、自然言語処理(NLP)分野を著しく進歩させ、活発な研究対象となっている。プランフォーマーは計画上の問題に微調整され、知識工学の努力を減らし、正確さと長さの点で良好な行動で計画を生成することができる。 Plansformerの1つの構成では、97%の有効なプランが達成されます。
論文参考訳（メタデータ） (2022-12-16T19:06:49Z)
PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change [34.93870615625937]
PlanBenchは、自動計画コミュニティで使用されるドメインの種類に基づいたベンチマークスイートである。 PlanBenchはタスクドメインと特定の計画機能の両方に十分な多様性を提供します。
論文参考訳（メタデータ） (2022-06-21T16:15:27Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。