Fugu-MT 論文翻訳(概要): HALO: A Unified Vision-Language-Action Model for Embodied Multimodal Chain-of-Thought Reasoning

論文の概要: HALO: A Unified Vision-Language-Action Model for Embodied Multimodal Chain-of-Thought Reasoning

arxiv url: http://arxiv.org/abs/2602.21157v2
Date: Fri, 27 Feb 2026 18:18:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:41.677675
Title: HALO: A Unified Vision-Language-Action Model for Embodied Multimodal Chain-of-Thought Reasoning
Title（参考訳）: HALO:マルチモーダル・オブ・サート推論のための統合ビジョン・ランゲージ・アクションモデル
Authors: Quanxin Shou, Fangqi Zhu, Shawn Chen, Puxin Yan, Zhengyang Yan, Yikun Miao, Xiaoyi Pang, Zicong Hong, Ruikai Shi, Hao Huang, Jie Zhang, Song Guo,
Abstract要約: VLA(Vision-Language-Action)モデルは、ロボット操作において強力な性能を示しているが、長い水平またはアウト・オブ・ディストリビューションのシナリオでしばしば苦労している。本稿では,マルチモーダル・チェーン・オブ・シークレット(EM-CoT)推論を可能にする統一VLAモデルであるHALOを提案する。 HALOをMixture-of-Transformers (MoT)アーキテクチャでインスタンス化し、セマンティック推論、視覚的予測、行動予測を専門の専門家に分離する。
参考スコア（独自算出の注目度）: 23.266655371621965
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language-Action (VLA) models have shown strong performance in robotic manipulation, but often struggle in long-horizon or out-of-distribution scenarios due to the lack of explicit mechanisms for multimodal reasoning and anticipating how the world will evolve under action. Recent works introduce textual chain-of-thought or visual subgoal prediction within VLA models to reason, but still fail to offer a unified human-like reasoning framework for joint textual reasoning, visual foresight, and action prediction. To this end, we propose HALO, a unified VLA model that enables embodied multimodal chain-of-thought (EM-CoT) reasoning through a sequential process of textual task reasoning, visual subgoal prediction for fine-grained guidance, and EM-CoT-augmented action prediction. We instantiate HALO with a Mixture-of-Transformers (MoT) architecture that decouples semantic reasoning, visual foresight, and action prediction into specialized experts while allowing seamless cross-expert collaboration. To enable HALO learning at scale, we introduce an automated pipeline to synthesize EM-CoT training data along with a carefully crafted training recipe. Extensive experiments demonstrate that: (1) HALO achieves superior performance in both simulated and real-world environments, surpassing baseline policy pi_0 by 34.1% on RoboTwin benchmark; (2) all proposed components of the training recipe and EM-CoT design help improve task success rate; and (3) HALO exhibits strong generalization capabilities under aggressive unseen environmental randomization with our proposed EM-CoT reasoning.
Abstract（参考訳）: VLA(Vision-Language-Action)モデルは、ロボット操作において強力なパフォーマンスを示しているが、多モーダル推論の明確なメカニズムの欠如と、世界が動作中にどのように進化するかを予測するために、長い水平またはアウト・オブ・ディストリビューションのシナリオでしばしば苦労している。近年の研究では、VLAモデル内でのテキスト・チェーン・オブ・シークエンスや視覚的サブゴール予測を導入しているが、共同テキスト・推論、視覚的フォレスト、行動予測のための統一的なヒューマンライクな推論フレームワークの提供には失敗している。そこで本稿では,テキストタスク推論,微粒化誘導のための視覚的サブゴール予測,EM-CoT拡張動作予測の逐次的プロセスを通じて,マルチモーダル・チェーン・オブ・シークレット(EM-CoT)推論を可能にする統一VLAモデルであるHALOを提案する。 HALOをMixture-of-Transformers (MoT)アーキテクチャでインスタンス化し、セマンティック推論、ビジュアルフォレスト、アクション予測を専門の専門家に分離し、シームレスなクロスエキスパートコラボレーションを可能にします。大規模なHALO学習を可能にするために,EM-CoTトレーニングデータと慎重に構築されたトレーニングレシピを合成する自動パイプラインを導入する。大規模実験では,(1) HALOは,ロボツインベンチマークの基準方針pi_0を34.1%上回り,シミュレーション環境と実環境環境の両方において優れた性能を達成し,(2) トレーニングレシピとEM-CoT設計のすべてのコンポーネントがタスク成功率の向上に寄与し,(3) HALOは,提案したEM-CoT推論による積極的な環境ランダム化の下で,強力な一般化能力を示す。

論文の概要: HALO: A Unified Vision-Language-Action Model for Embodied Multimodal Chain-of-Thought Reasoning

関連論文リスト