Fugu-MT 論文翻訳(概要): Guava: An Effective and Universal Harness for Embodied Manipulation

論文の概要: Guava: An Effective and Universal Harness for Embodied Manipulation

arxiv url: http://arxiv.org/abs/2606.18363v1
Date: Tue, 16 Jun 2026 18:09:26 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-18 17:16:50.835885
Title: Guava: An Effective and Universal Harness for Embodied Manipulation
Title（参考訳）: Guava: 身体操作のための効果的で普遍的なハーネス
Authors: Haowen Liu, Xirui Li, Shaoxiong Yao, Peng Shi, Tianyi Zhou, Jia-Bin Huang, Furong Huang, Jiayuan Mao,
Abstract要約: 実装された操作能力を4Bオープンソースモデルに蒸留するエンドツーエンドのトレーニングパイプラインを開発した。結果は、よく設計されたハーネスが、エンボディド操作のためのスケーラブルでモデルに依存しないインターフェースとして機能することを示唆している。
参考スコア（独自算出の注目度）: 74.34187069605844
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Language models trained on large-scale vision-language data have demonstrated strong potential for embodied agents. Harnessing models through embodied tools use offers a promising alternative to end-to-end vision-language-action systems by combining high-level reasoning with external modules for perception, planning, and control. However, it remains unclear what makes an effective harness for embodied manipulation, and to what extent such a harness can unlock embodied capabilities in a wide range of reasoning models. In this work, we present Guava, a harness framework for embodied tool use developed through systematic exploration of the design space of agent workflows, action spaces, and observation spaces. Our study identifies three key ingredients for effective embodied agents: iterative perception-reasoning-action loops, semantic action abstractions, and multimodal observations. To understand whether these design principles are universal even to small models, we develop an end-to-end training pipeline that distills embodied manipulation capabilities into a 4B open-source model using fewer than 2K trajectories collected entirely in simulation. Experimental results in both simulation and real-world environments show performance comparable to frontier proprietary models while exhibiting strong generalization to unseen objects, novel instructions, and long-horizon tasks. Results suggest that a well-designed harness can serve as a scalable, model-agnostic interface for embodied manipulation, enabling strong emergent embodied capabilities in compact open-source models with minimal training data.
Abstract（参考訳）: 大規模視覚言語データに基づいて訓練された言語モデルは、エンボディエージェントの強い可能性を示している。具体的ツールを使ってモデルを調和させることは、知覚、計画、制御のために高レベルな推論と外部モジュールを組み合わせることで、エンドツーエンドの視覚言語アクションシステムに代わる有望な代替手段を提供する。しかし, 具体的操作の有効活用方法や, 広範囲の推論モデルにおいて, どの程度の精度で具体的操作を解き放つことができるのかは, いまだ不明である。本研究では,エージェント・ワークフロー,アクション・スペース,観察空間の設計空間を体系的に探索し,ツール・ユースを具体化するためのフレームワークGuavaを紹介する。本研究は, 反復的知覚反応ループ, セマンティック・アクション・抽象化, マルチモーダル・オブザーバという, 効果的なエンボディード・エージェントの3つの重要な要素を同定した。これらの設計原則が小さなモデルであっても普遍的であるかどうかを理解するため、我々は、完全にシミュレーションで収集された2Kトラジェクトリ未満のトラジェクトリを用いて、エンボディ化された操作能力を4Bオープンソースモデルに蒸留するエンドツーエンドのトレーニングパイプラインを開発した。シミュレーションと実世界の両方の環境での実験結果は、フロンティアのプロプライエタリなモデルに匹敵する性能を示しながら、見えないオブジェクト、新しい命令、長い水平タスクへの強力な一般化を示している。結果として、十分に設計されたハーネスは、エンボディド操作のためのスケーラブルでモデルに依存しないインターフェースとして機能し、最小限のトレーニングデータを持つコンパクトなオープンソースモデルにおいて、強力な創発的エンボディド機能を実現することが示唆された。

論文の概要: Guava: An Effective and Universal Harness for Embodied Manipulation

関連論文リスト