Fugu-MT 論文翻訳(概要): FutureVLA: Joint Visuomotor Prediction for Vision-Language-Action Model

論文の概要: FutureVLA: Joint Visuomotor Prediction for Vision-Language-Action Model

arxiv url: http://arxiv.org/abs/2603.10712v1
Date: Wed, 11 Mar 2026 12:39:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-21 18:33:56.683142
Title: FutureVLA: Joint Visuomotor Prediction for Vision-Language-Action Model
Title（参考訳）: FutureVLA:ビジョン・ランゲージ・アクションモデルのための関節振動子予測
Authors: Xiaoxu Xu, Hao Li, Jinhui Ye, Yilun Chen, Jia Zeng, Xinyi Chen, Linning Xu, Dahua Lin, Weixin Li, Jiangmiao Pang,
Abstract要約: 我々は、効果的な共同運動予測モデルには、時間的連続性と視覚的条件による監督的疎結合の両方が必要であると論じる。 FutureVLAは、視覚情報と運動情報を最初に分離することで、関節振動子埋め込みを抽出するように設計されている。訓練後の段階において、我々は遅延埋め込みアライメント戦略を採用し、様々な下流VLAモデルによりこれらの時間的先行を内部化することができる。
参考スコア（独自算出の注目度）: 73.03346643967309
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Predictive foresight is important to intelligent embodied agents. Since the motor execution of a robot is intrinsically constrained by its visual perception of environmental geometry, effectively anticipating the future requires capturing this tightly coupled visuomotor interplay. While recent vision-language-action models attempt to incorporate future guidance, they struggle with this joint modeling. Existing explicit methods divert capacity to task-irrelevant visual details, whereas implicit methods relying on sparse frame pairs disrupt temporal continuity. By heavily relying on visual reconstruction, these methods become visually dominated, entangling static scene context with dynamic action intent. We argue that effective joint visuomotor predictive modeling requires both temporal continuity and visually-conditioned supervision decoupling. To this end, we propose FutureVLA, featuring a novel Joint Visuomotor Predictive Architecture. FutureVLA is designed to extract joint visuomotor embeddings by first decoupling visual and motor information, and then jointly encoding generalized physical priors. Specifically, in the pretraining stage, we leverage heterogeneous manipulation datasets and introduce a Joint Visuomotor Gating mechanism to structurally separate visual state preservation from temporal action modeling. It allows the motor stream to focus on continuous physical dynamics while explicitly querying visual tokens for environmental constraints, yielding highly generalizable joint visuomotor embeddings. Subsequently, in the post-training stage, we employ a latent embeddings alignment strategy, enabling diverse downstream VLA models to internalize these temporal priors without modifying their inference architectures. Extensive experiments demonstrate that FutureVLA consistently improves VLA frameworks.
Abstract（参考訳）: 予測予測は、インテリジェントなエンボディエージェントにとって重要である。ロボットの運動実行は、環境幾何学の視覚的知覚によって本質的に制約されているため、将来を効果的に予測するには、この密結合された視覚運動者相互作用を捉える必要がある。最近の視覚-言語-アクションモデルは将来のガイダンスを取り入れようとしているが、彼らはこの共同モデリングに苦慮している。既存の明示的手法はタスク非関連な視覚的詳細にキャパシティを分散させ、一方、スパースフレーム対に依存する暗黙的手法は時間的連続性を阻害する。視覚的再構成に強く依存することにより、これらの手法は視覚的に支配的になり、静的なシーンコンテキストと動的なアクション意図を絡み合わせる。我々は、効果的な共同運動予測モデルには、時間的連続性と視覚的条件による監督的疎結合の両方が必要であると論じる。この目的のために、我々はFutureVLAを提案し、新しいジョイントビジュモータ予測アーキテクチャを提案する。 FutureVLAは、まず視覚情報と運動情報を分離し、次に一般化された物理先行情報を共同で符号化することで、関節振動子埋め込みを抽出するように設計されている。具体的には、事前訓練段階において、異種操作データセットを活用し、時間的動作モデリングから視覚状態の保存を構造的に分離するジョイント・ビジュモータ・ゲーティング機構を導入する。運動ストリームは、環境制約に対して視覚的トークンを明示的にクエリしながら、連続的な物理力学に焦点を合わせることができ、非常に一般化可能な関節振動子埋め込みをもたらす。その後、トレーニング後の段階では、遅延埋め込みアライメント戦略を採用し、様々な下流VLAモデルにより、推論アーキテクチャを変更することなく、これらの時間的事前を内部化することができる。大規模な実験では、FutureVLAは一貫してVLAフレームワークを改善している。

論文の概要: FutureVLA: Joint Visuomotor Prediction for Vision-Language-Action Model

関連論文リスト