Fugu-MT 論文翻訳(概要): MaP-AVR: A Meta-Action Planner for Agents Leveraging Vision Language Models and Retrieval-Augmented Generation

論文の概要: MaP-AVR: A Meta-Action Planner for Agents Leveraging Vision Language Models and Retrieval-Augmented Generation

arxiv url: http://arxiv.org/abs/2512.19453v1
Date: Mon, 22 Dec 2025 14:58:52 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:40.470325
Title: MaP-AVR: A Meta-Action Planner for Agents Leveraging Vision Language Models and Retrieval-Augmented Generation
Title（参考訳）: MaP-AVR: 視覚言語モデルと検索生成を活用するエージェントのためのメタアクションプランナ
Authors: Zhenglong Guo, Yiming Zhao, Feng Jiang, Heng Jin, Zongbao Feng, Jianbin Zhou, Siyuan Xu,
Abstract要約: 複雑な日々のタスクを管理するように設計されたロボットAIシステムは、ハイレベルなタスクを理解し分解するためのタスクプランナーに依存している。本稿は、計画されたスキルセットを定義することが同様に重要である、と論じる。日々の環境の複雑さに対処するためには、スキルセットは高度な一般化能力を持つべきである。
参考スコア（独自算出の注目度）: 18.84633713315585
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Embodied robotic AI systems designed to manage complex daily tasks rely on a task planner to understand and decompose high-level tasks. While most research focuses on enhancing the task-understanding abilities of LLMs/VLMs through fine-tuning or chain-of-thought prompting, this paper argues that defining the planned skill set is equally crucial. To handle the complexity of daily environments, the skill set should possess a high degree of generalization ability. Empirically, more abstract expressions tend to be more generalizable. Therefore, we propose to abstract the planned result as a set of meta-actions. Each meta-action comprises three components: {move/rotate, end-effector status change, relationship with the environment}. This abstraction replaces human-centric concepts, such as grasping or pushing, with the robot's intrinsic functionalities. As a result, the planned outcomes align seamlessly with the complete range of actions that the robot is capable of performing. Furthermore, to ensure that the LLM/VLM accurately produces the desired meta-action format, we employ the Retrieval-Augmented Generation (RAG) technique, which leverages a database of human-annotated planning demonstrations to facilitate in-context learning. As the system successfully completes more tasks, the database will self-augment to continue supporting diversity. The meta-action set and its integration with RAG are two novel contributions of our planner, denoted as MaP-AVR, the meta-action planner for agents composed of VLM and RAG. To validate its efficacy, we design experiments using GPT-4o as the pre-trained LLM/VLM model and OmniGibson as our robotic platform. Our approach demonstrates promising performance compared to the current state-of-the-art method. Project page: https://map-avr.github.io/.
Abstract（参考訳）: 複雑な日々のタスクを管理するように設計されたロボットAIシステムは、ハイレベルなタスクを理解し分解するためのタスクプランナーに依存している。多くの研究は、細調整やチェーン・オブ・シント・プロンプトによるLLM/VLMのタスク理解能力の向上に重点を置いているが、この論文では、計画されたスキルセットを定義することが同様に重要であると論じている。日々の環境の複雑さに対処するためには、スキルセットは高度な一般化能力を持つべきである。経験的に、より抽象的な表現はより一般化できる傾向がある。そこで本研究では,計画された結果をメタアクションの集合として抽象化することを提案する。各メタアクションは、{move/rotate, end-effector status change, relationship with the environment}の3つのコンポーネントから構成される。この抽象化は、つかみや押すといった人間中心の概念を、ロボットの本質的な機能に置き換える。その結果、計画された結果は、ロボットが実行可能な完全な動作範囲とシームレスに整合する。さらに, LLM/VLMが所望のメタアクションフォーマットを正確に生成することを保証するために, 人間が記述した計画実証のデータベースを活用して, 文脈内学習を容易にするRetrieval-Augmented Generation (RAG) 技術を用いる。システムがより多くのタスクを完了すると、データベースは自己拡張して多様性をサポートし続ける。メタアクションセットとRAGとの統合は、VLMとRAGからなるエージェントのためのメタアクションプランナーであるMaP-AVRと呼ばれる、我々のプランナーの2つの新しいコントリビューションである。 GPT-4oをトレーニング済みLLM/VLMモデルとし,OmniGibsonをロボットプラットフォームとして設計した。提案手法は,現在の最先端手法と比較して有望な性能を示す。プロジェクトページ: https://map-avr.github.io/.com

論文の概要: MaP-AVR: A Meta-Action Planner for Agents Leveraging Vision Language Models and Retrieval-Augmented Generation

関連論文リスト