Fugu-MT 論文翻訳(概要): One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

論文の概要: One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

arxiv url: http://arxiv.org/abs/2605.07931v3
Date: Wed, 13 May 2026 19:21:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-15 15:19:49.84026
Title: One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
Title（参考訳）: フレームごとのワントークン:VLA政策のための世界モデルにおける視覚帯域の再検討
Authors: Zuojin Tang, Shengchao Yuan, Xiaoxin Bai, Zhiyuan Jing, De Ma, Gang Pan, Bin Liu,
Abstract要約: 視覚言語アクション(VLA)モデルは、長い地平線を計画するために補助的な世界モジュールにますます依存している。 OneWM-VLAは、Adaptive Attention Poolingを通じて、各ビューをフレーム毎に単一のセマンティックトークンに圧縮する。フレームごとの視覚的帯域幅を1つのトークンに減らすことができ、長い水平性能を損なうことなく実現できる。
参考スコア（独自算出の注目度）: 20.112404170033944
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-language-action (VLA) models increasingly rely on auxiliary world modules to plan over long horizons, yet how such modules should be parameterized on top of a pretrained VLA remains an open design question. Existing world-model-augmented VLAs typically pass the per-frame visual stream into the world module at high visual bandwidth and treat its rollout as a side product of action prediction; under a constrained adaptation budget on a frozen backbone, this leaves both the per-frame representation and the latent action coupling under-examined. We introduce OneWM-VLA, which compresses each view into a single semantic token per frame through an Adaptive Attention Pooling, and produces the resulting latent stream and the action trajectory under a single flow-matching objective rather than connecting them through a separate decoder. Empirically, we find that per-frame visual bandwidth can be reduced to a single token without compromising long-horizon performance under our setup. Trained with 14.71M LoRA parameters on a $π_0$ (2B) backbone, OneWM-VLA improves the average success rate from 47.9% to 61.3% on MetaWorld~MT50, reaches 95.6% on LIBERO-Long (vs.85.2% for $π_0$), and reaches 60.0% on the long-horizon deformable task Fold Cloth on a real Piper arm (vs.20.0% for $π_0$).
Abstract（参考訳）: 視覚言語アクション(VLA)モデルは、長い地平線を計画するために補助的な世界モジュールにますます依存しているが、事前訓練されたVLAの上にそのようなモジュールをどのようにパラメータ化すべきかは、オープンデザインの問題のままである。既存のワールドモデル拡張VLAは通常、フレーム単位のビジュアルストリームを高い視覚帯域幅で世界モジュールに渡し、ロールアウトをアクション予測の副産物として扱う。本稿では,各ビューをアダプティブ・アテンション・プール(Adaptive Attention Pooling)を通じてフレーム毎に1つのセマンティック・トークンに圧縮するOneWM-VLAを提案する。実験により,フレーム単位の視覚的帯域幅を1つのトークンに削減できることがわかった。 π_0$ (2B) のバックボーン上で 14.71M LoRA パラメータでトレーニングされた OneWM-VLA は、MetaWorld~MT50 で 47.9% から 61.3% に改善され、LIBERO-Long で 95.6% (vs.85.2% for $π_0$) に達し、長い水平変形可能なタスク Fold Cloth で 60.0% (vs.20.0% for $π_0$) に達する。

論文の概要: One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

関連論文リスト