Fugu-MT 論文翻訳(概要): Being-H0.7: A Latent World-Action Model from Egocentric Videos

論文の概要: Being-H0.7: A Latent World-Action Model from Egocentric Videos

arxiv url: http://arxiv.org/abs/2605.00078v1
Date: Thu, 30 Apr 2026 14:16:15 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-04 17:43:28.678756
Title: Being-H0.7: A Latent World-Action Model from Egocentric Videos
Title（参考訳）: エイブ・H0.7:エゴセントリックなビデオから生まれた世界アクション・モデル
Authors: Hao Luo, Wanpeng Zhang, Yicheng Feng, Sipeng Zheng, Haiweng Xu, Chaoyi Xu, Ziheng Xi, Yuhui Fu, Zongqing Lu,
Abstract要約: 我々は、VLAスタイルのポリシーに未来を意識した推論をもたらす潜在的世界行動モデルであるBeing-H0.7を提案する。 being-H0.7は、知覚と行動の間の学習可能な遅延クエリを、コンパクトな推論インターフェイスとして挿入する。
参考スコア（独自算出の注目度）: 32.77431338471086
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual-Language-Action models (VLAs) have advanced generalist robot control by mapping multimodal observations and language instructions directly to actions, but sparse action supervision often encourages shortcut mappings rather than representations of dynamics, contact, and task progress. Recent world-action models introduce future prediction through video rollouts, yet pixel-space prediction is a costly and indirect substrate for control, as it may model visual details irrelevant to action generation and introduces substantial training or inference overhead. We present Being-H0.7, a latent world-action model that brings future-aware reasoning into VLA-style policies without generating future frames. Being-H0.7 inserts learnable latent queries between perception and action as a compact reasoning interface, and trains them with a future-informed dual-branch design: a deployable prior branch infers latent states from the current context, while a training-only posterior branch replaces the queries with embeddings from future observations. Jointly aligning the two branches at the latent reasoning space leads the prior branch to reason future-aware, action-useful structure from current observations alone. At inference, Being-H0.7 discards the posterior branch and performs no visual rollout. Experiments across six simulation benchmarks and diverse real-world tasks show that Being-H0.7 achieves state-of-the-art or comparable performance, combining the predictive benefits of world models with the efficiency and deployability of direct VLA policies.
Abstract（参考訳）: VLA(Visual-Language-Action Model)は、マルチモーダルな観察と言語指示を直接アクションにマッピングすることで、高度な汎用的なロボット制御を持つが、疎い行動監督は、動的、接触、タスク進捗の表現よりも、ショートカットマッピングを奨励することが多い。最近のワールドアクションモデルは、ビデオロールアウトを通じて将来の予測を導入するが、ピクセルスペース予測は、アクション生成とは無関係に視覚的詳細をモデル化し、相当なトレーニングや推論オーバーヘッドを導入するため、コストが高く間接的な制御基板である。我々は、将来的なフレームを生成することなく、VLAスタイルのポリシーに未来を意識した推論をもたらす、潜在的世界行動モデルであるBeing-H0.7を提案する。 A-H0.7は、知覚と行動の間の学習可能な遅延クエリをコンパクトな推論インターフェースとして挿入し、将来のインフォームドなデュアルブランチ設計でそれらをトレーニングする: デプロイ可能な事前ブランチは現在のコンテキストから遅延状態を推論し、トレーニング専用の後続ブランチは将来の観測から埋め込みでクエリを置き換える。後続の推論空間で2つの枝を協調的に整列させることで、前枝は現在の観測のみから将来の認識、行動に有用な構造を推論する。推測では、Being-H0.7は後枝を捨て、視覚的なロールアウトを行わない。 6つのシミュレーションベンチマークと多様な実世界のタスクの実験により、Beat-H0.7は最先端または同等のパフォーマンスを達成し、世界モデルの予測上の利点と直接VLAポリシーの効率性とデプロイ性を組み合わせた。

論文の概要: Being-H0.7: A Latent World-Action Model from Egocentric Videos

関連論文リスト