Fugu-MT 論文翻訳(概要): Afford-VLA: Action-Aligned Visual Planning via Internalized Affordance

論文の概要: Afford-VLA: Action-Aligned Visual Planning via Internalized Affordance

arxiv url: http://arxiv.org/abs/2605.24203v1
Date: Fri, 22 May 2026 20:43:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-02 18:30:46.65661
Title: Afford-VLA: Action-Aligned Visual Planning via Internalized Affordance
Title（参考訳）: Afford-VLA: 内部処理による行動適応型ビジュアルプランニング
Authors: Runze Wang, Yuqian Fu, Yu Li, Tao Lin, Tianwen Qian, Mohamed Elhoseiny, Bo Zhao, Yanwei Fu, Yu-Gang Jiang, Xiangyang Xue,
Abstract要約: 効果的なプランニングは、ローカルで、視覚的に基礎があり、内部で生成され、アクションと直接整合するべきである、と私たちは主張する。本稿では,タスク条件付きアベイランスをVLAモデル内で明示的な視覚的計画インターフェースとして内包する統合フレームワークであるAfford-VLAを提案する。
参考スコア（独自算出の注目度）: 108.46436073194546
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-language-action (VLA) models have shown strong potential for generalist robot manipulation, yet they remain limited by insufficient spatial reasoning, particularly in determining where to interact in complex visual scenes. While recent efforts introduce various forms of visual planning to address this issue, existing approaches either rely on global geometric cues, symbolic intermediate representations, or externally generated visual signals, which are often weakly coupled with downstream action prediction. In this work, we revisit visual planning in VLA systems and argue that effective planning should be local, visually grounded, internally generated, and directly aligned with action. Based on this insight, we propose Afford-VLA, a unified framework that internalizes task-conditioned affordance as an explicit visual planning interface within VLA models. Concretely, we introduce learnable <AFF> tokens to query task-relevant interaction regions, decode affordance masks from multimodal features, and convert them into compact embeddings that directly condition action generation. This design enables affordance to be both generated and utilized within the VLA, forming a tightly coupled perception-action pathway. To further support this integration, we adopt a training strategy that allows the affordance pathway to be jointly optimized with action prediction, improving its effectiveness for downstream control. We evaluate our method on multiple simulation benchmarks, including LIBERO, LIBERO-Plus, and SimplerEnv, achieving consistent state-of-the-art performance, along with strong real-world results. These findings demonstrate that internalizing affordance as action-aligned visual planning provides a powerful paradigm for improving VLA systems.
Abstract（参考訳）: 視覚言語アクション(VLA)モデルは、一般的なロボット操作の強い可能性を示しているが、空間的推論が不十分なため、特に複雑な視覚シーンにおける相互作用の場所を決定することは限られている。近年の取り組みでは、この問題に対処するための様々な視覚的計画法が提案されているが、既存のアプローチは、グローバルな幾何学的手がかり、記号的中間表現、あるいは、しばしば下流の行動予測と弱い結合である外部的に生成された視覚信号に依存している。本研究は,VLAシステムにおける視覚計画を再考し,効果的計画は局所的,視覚的基盤的,内部的に生成され,行動と直接整合するべきであると論じる。この知見に基づいて,タスク条件付きアベイランスをVLAモデル内で明示的な視覚的計画インターフェースとして内包する統合フレームワークであるAfford-VLAを提案する。具体的には,タスク関連インタラクション領域を問合せする学習可能な<AFF>トークンを導入し,マルチモーダルな特徴から余剰マスクをデコードし,アクション生成を直接条件付けするコンパクトな埋め込みに変換する。この設計により、VLA内での空き時間の生成と利用が可能となり、密結合された知覚反応経路を形成する。この統合をさらに支援するため、我々は、アベイランス経路をアクション予測と協調的に最適化し、下流制御の有効性を向上させるためのトレーニング戦略を採用した。 LIBERO, LIBERO-Plus, SimplerEnv など複数のシミュレーションベンチマークを用いて, 高い実世界の結果とともに一貫した技術性能を実現する。これらの結果から,アクション・アライン・ビジュアル・プランニングとしての能力の内在化が,VLAシステムの改善に有効なパラダイムであることが示唆された。

論文の概要: Afford-VLA: Action-Aligned Visual Planning via Internalized Affordance

関連論文リスト