Fugu-MT 論文翻訳(概要): $Δ$VLA: Prior-Guided Vision-Language-Action Models via World Knowledge Variation

論文の概要: $Δ$VLA: Prior-Guided Vision-Language-Action Models via World Knowledge Variation

arxiv url: http://arxiv.org/abs/2603.08361v1
Date: Mon, 09 Mar 2026 13:26:05 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:16.088018
Title: $Δ$VLA: Prior-Guided Vision-Language-Action Models via World Knowledge Variation
Title（参考訳）: Δ$VLA:世界知識変動による事前指導型視覚言語行動モデル
Authors: Yijie Zhu, Jie He, Rui Shao, Kaishen Yuan, Tao Tan, Xiaochen Yuan, Zitong Yu,
Abstract要約: VLAは、アクション生成に先立って、現世界の明示的な知識に対して、世界知識のバリエーションをモデル化する事前ガイダンスフレームワークである。 $VLAは、最先端のパフォーマンスを実現し、効率を向上する。
参考スコア（独自算出の注目度）: 46.27589938801435
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent vision-language-action (VLA) models have significantly advanced robotic manipulation by unifying perception, reasoning, and control. To achieve such integration, recent studies adopt a predictive paradigm that models future visual states or world knowledge to guide action generation. However, these models emphasize forecasting outcomes rather than reasoning about the underlying process of change, which is essential for determining how to act. To address this, we propose $Δ$VLA, a prior-guided framework that models world-knowledge variations relative to an explicit current-world knowledge prior for action generation, rather than regressing absolute future world states. Specifically, 1) to construct the current world knowledge prior, we propose the Prior-Guided WorldKnowledge Extractor (PWKE). It extracts manipulable regions, spatial relations, and semantic cues from the visual input, guided by auxiliary heads and prior pseudo labels, thus reducing redundancy. 2) Building upon this, to represent how world knowledge evolves under actions, we introduce the Latent World Variation Quantization (LWVQ). It learns a discrete latent space via a VQ-VAE objective to encode world knowledge variations, shifting prediction from full modalities to compact latent. 3)Moreover, to mitigate interference during variation modeling, we design the Conditional Variation Attention (CV-Atten), whichpromotes disentangled learning and preserves the independence of knowledge representations. Extensive experiments on both simulated benchmarks and real-world robotic tasks demonstrate $Δ$VLA achieves state-of-the-art performance while improving efficiency. Code and real-world execution videos are available at https://github.com/JiuTian-VL/DeltaVLA.
Abstract（参考訳）: 近年の視覚言語アクション(VLA)モデルでは、知覚、推論、制御を統一することでロボット操作が大幅に進歩している。このような統合を実現するために、近年の研究では、将来の視覚状態や世界知識をモデル化して行動生成を導く予測パラダイムが採用されている。しかし、これらのモデルは変化の根底にあるプロセスについて推論するのではなく、予測結果を強調する。そこで本研究では,行動生成に先立って,行動生成に先立つ明示的な現世界の知識に対して,世界知識の変動をモデル化する事前指導型フレームワークである$Δ$VLAを提案する。具体的には 1) 先進的な世界知識を構築するために,PWKE(Presideed-Guided WorldKnowledge Extractor)を提案する。視覚入力から操作可能な領域、空間関係、意味的手がかりを抽出し、補助的な頭部と先行する擬似ラベルによって誘導される。 2) 行動下での世界の知識の進化を示すために, 潜在世界変分量化(LWVQ)を導入する。 VQ-VAEの目的により離散潜在空間を学習し、世界知識の変動を符号化し、予測を全モードからコンパクト潜在空間にシフトする。 3)変分モデルにおける干渉を軽減するため,不整合学習を促進させ,知識表現の独立性を維持する条件変分注意(CV-Atten)を設計する。シミュレーションされたベンチマークと実世界のロボットタスクの両方に対する大規模な実験は、$Δ$VLAが最先端のパフォーマンスを達成し、効率を向上することを示した。コードと実世界の実行ビデオはhttps://github.com/JiuTian-VL/DeltaVLAで公開されている。

論文の概要: $Δ$VLA: Prior-Guided Vision-Language-Action Models via World Knowledge Variation

関連論文リスト