Fugu-MT 論文翻訳(概要): FRAPPE: Infusing World Modeling into Generalist Policies via Multiple Future Representation Alignment

論文の概要: FRAPPE: Infusing World Modeling into Generalist Policies via Multiple Future Representation Alignment

arxiv url: http://arxiv.org/abs/2602.17259v1
Date: Thu, 19 Feb 2026 11:00:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:41.567802
Title: FRAPPE: Infusing World Modeling into Generalist Policies via Multiple Future Representation Alignment
Title（参考訳）: FRAPPE:多元的表現アライメントによるジェネリスト政策への世界モデル導入
Authors: Han Zhao, Jingbo Wang, Wenxuan Song, Shuai Chen, Yang Liu, Yan Wang, Haoang Li, Donglin Wang,
Abstract要約: VLAモデルは、世界モデリングとして知られる環境力学を予測することができる。現在のアプローチでは,1) セマンティックラーニングと一般化を制約する画素レベルの再構成を過度に強調する訓練対象モデル,2) 推論中に予測される将来の観測への信頼が,しばしばエラーの蓄積につながる,という2つの問題に直面している。これらの課題に対処するために、並列プログレッシブ拡張(FRAPPE)を介して、Future Representation Alignmentを導入する。
参考スコア（独自算出の注目度）: 45.21358158991569
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Enabling VLA models to predict environmental dynamics, known as world modeling, has been recognized as essential for improving robotic reasoning and generalization. However, current approaches face two main issues: 1. The training objective forces models to over-emphasize pixel-level reconstruction, which constrains semantic learning and generalization 2. Reliance on predicted future observations during inference often leads to error accumulation. To address these challenges, we introduce Future Representation Alignment via Parallel Progressive Expansion (FRAPPE). Our method adopts a two-stage fine-tuning strategy: In the mid-training phase, the model learns to predict the latent representations of future observations; In the post-training phase, we expand the computational workload in parallel and align the representation simultaneously with multiple different visual foundation models. By significantly improving fine-tuning efficiency and reducing dependence on action-annotated data, FRAPPE provides a scalable and data-efficient pathway to enhance world-awareness in generalist robotic policies. Experiments on the RoboTwin benchmark and real-world tasks demonstrate that FRAPPE outperforms state-of-the-art approaches and shows strong generalization in long-horizon and unseen scenarios.
Abstract（参考訳）: VLAモデルは、世界モデリングとして知られる環境力学を予測し、ロボットの推論と一般化を改善するために不可欠であると認識されている。しかし、現在のアプローチは2つの主な問題に直面している。 1. 訓練対象は、意味学習と一般化を制約する画素レベルの再構築を過度に強調するようモデルに強制する 2.推測中の将来の観測の信頼性は、しばしばエラーの蓄積につながる。これらの課題に対処するために、並列プログレッシブ拡張(FRAPPE)を介して、Future Representation Alignmentを導入する。本手法では,2段階の微調整戦略を採用する。訓練中段階において,モデルが将来の観測の潜伏表現を予測することを学習し,学習後段階において,計算負荷を並列に拡大し,複数の異なる視覚基盤モデルと同時に表現を調整する。微調整の効率を大幅に改善し、アクションアノテートされたデータへの依存を減らすことにより、FRAPPEは、汎用的なロボットポリシーにおける世界認識を高めるスケーラブルでデータ効率のよい経路を提供する。 RoboTwinベンチマークと実世界のタスクの実験では、FRAPPEは最先端のアプローチよりも優れており、長い水平および目に見えないシナリオにおいて強力な一般化を示している。

論文の概要: FRAPPE: Infusing World Modeling into Generalist Policies via Multiple Future Representation Alignment

関連論文リスト