Fugu-MT 論文翻訳(概要): VoyagerVision: Investigating the Role of Multi-modal Information for Open-ended Learning Systems

論文の概要: VoyagerVision: Investigating the Role of Multi-modal Information for Open-ended Learning Systems

arxiv url: http://arxiv.org/abs/2507.00079v1
Date: Sun, 29 Jun 2025 14:16:11 GMT
ステータス: 翻訳完了
システム内更新日: 2025-07-03 14:22:58.331215
Title: VoyagerVision: Investigating the Role of Multi-modal Information for Open-ended Learning Systems
Title（参考訳）: VoyagerVision:オープンエンド学習システムにおけるマルチモーダル情報の役割を探る
Authors: Ethan Smyth, Alessandro Suglia,
Abstract要約: 本稿では、スクリーンショットを視覚的フィードバックの一形態として利用してMinecraft内で構造を作成できるVoyagerVisionを提案する。ボイジャーヴィジョンは平らな世界での全ての試みの半分で成功し、ほとんどの失敗はより複雑な構造で発生した。
参考スコア（独自算出の注目度）: 50.97354139604596
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Open-endedness is an active field of research in the pursuit of capable Artificial General Intelligence (AGI), allowing models to pursue tasks of their own choosing. Simultaneously, recent advancements in Large Language Models (LLMs) such as GPT-4o [9] have allowed such models to be capable of interpreting image inputs. Implementations such as OMNI-EPIC [4] have made use of such features, providing an LLM with pixel data of an agent's POV to parse the environment and allow it to solve tasks. This paper proposes that providing these visual inputs to a model gives it greater ability to interpret spatial environments, and as such, can increase the number of tasks it can successfully perform, extending its open-ended potential. To this aim, this paper proposes VoyagerVision -- a multi-modal model capable of creating structures within Minecraft using screenshots as a form of visual feedback, building on the foundation of Voyager. VoyagerVision was capable of creating an average of 2.75 unique structures within fifty iterations of the system, as Voyager was incapable of this, it is an extension in an entirely new direction. Additionally, in a set of building unit tests VoyagerVision was successful in half of all attempts in flat worlds, with most failures arising in more complex structures. Project website is available at https://esmyth-dev.github.io/VoyagerVision.github.io/
Abstract（参考訳）: オープンエンドネスは、有能な人工知能(AGI)の追求において活発な研究分野であり、モデルが自身の選択したタスクを追求できるようにする。同時に、GPT-4o[9]のような最近の大規模言語モデル(LLM)の進歩により、画像入力の解釈が可能になった。 OMNI-EPIC [4]のような実装では、エージェントのPOVのピクセルデータをLLMに提供して、環境を解析し、タスクの解決を可能にしている。本稿では,これらの視覚的入力をモデルに付与することにより,空間環境の解釈能力が向上し,実行可能なタスク数を増大させ,そのオープンエンドポテンシャルを拡大できることを示す。この目的のために,VoyagerVision というマルチモーダルモデルを提案する。これは Voyager の基盤を基盤として,スクリーンショットを視覚的フィードバックの形式として,Minecraft 内で構造を生成可能なマルチモーダルモデルである。ボイジャーヴィジョンは50回で平均2.75のユニークな構造を創り出すことができ、ボイジャーはこれを不可能としていたため、全く新しい方向に拡張された。さらに、一連の単体テストでは、ボイジャーヴィジョンは平らな世界での全ての試みの半分で成功し、ほとんどの失敗はより複雑な構造で発生した。 Project Webサイトはhttps://esmyth-dev.github.io/VoyagerVision.github.io/で公開されている。

論文の概要: VoyagerVision: Investigating the Role of Multi-modal Information for Open-ended Learning Systems

関連論文リスト