Fugu-MT 論文翻訳(概要): VeriSpace: Spatially Grounded Action Verification for Vision-Language-Action Models

論文の概要: VeriSpace: Spatially Grounded Action Verification for Vision-Language-Action Models

arxiv url: http://arxiv.org/abs/2606.10568v1
Date: Tue, 09 Jun 2026 08:31:59 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-11 16:42:38.005574
Title: VeriSpace: Spatially Grounded Action Verification for Vision-Language-Action Models
Title（参考訳）: VeriSpace:視覚・言語・アクションモデルのための空間的接地行動検証
Authors: Guiyu Zhao, Longteng Guo, Junyou Zhu, Jun Fu, Yanghong Mei, Bin Cao, Jie Jiang, Xingjian He, Jing Liu,
Abstract要約: VLA(Vision-Ground-action)モデルは、ロボット操作に強く期待されている。しかし、テスト時の信頼性は、ワンショットアクション予測によって制限されている。 VLAシステムにおけるテスト時動作選択のための3D対応検証器であるVeriSpaceを提案する。
参考スコア（独自算出の注目度）: 19.75611749501909
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-language-action (VLA) models have shown strong promise for robotic manipulation, but their reliability at test time remains limited by one-shot action prediction, where even small action errors can cause grasp failure, collision, or incorrect task progression. A natural alternative is to equip VLA systems with test-time verification, allowing multiple candidate actions to be proposed and evaluated before execution. However, reliable action verification is challenging because it requires not only distinguishing subtle geometric differences between candidate actions, but also assessing whether an action makes meaningful progress toward the task goal. We present VeriSpace, a 3D-aware action verifier for test-time action selection in VLA systems. VeriSpace evaluates candidate actions through two key components: Dual-Path 3D-Injected Scene Encoding, which constructs a scene representation that jointly preserves visual semantics and explicit 3D geometry, and Spatially-Grounded Action Reasoning, which evaluates each action by reasoning over task-relevant spatial relations, geometric validity, and expected goal progress. Together, these components enable more reliable discrimination between subtle yet outcome-critical action candidates while remaining fully compatible with existing VLA policies. Experiments on public benchmarks and real-world robotic manipulation tasks show that VeriSpace consistently improves decision reliability over both underlying VLA policies and prior verification-based methods, yielding substantial gains in both in-distribution and out-of-distribution settings.
Abstract（参考訳）: 視覚言語アクション(VLA)モデルはロボット操作に強く期待されているが、テスト時の信頼性は単発動作予測によって制限されている。自然な方法は、VLAシステムにテスト時間検証を装備し、複数の候補アクションが実行前に提案され評価されるようにすることである。しかし、信頼性の高い行動検証は、候補行動間の微妙な幾何学的差異を区別するだけでなく、タスク目標に向かって意味のある前進をさせるかどうかを評価する必要があるため、難しい。 VLAシステムにおける実時間動作選択のための3D対応動作検証器であるVeriSpaceを提案する。 VeriSpaceは、視覚意味論と明示的な3D幾何学を共同で保存するシーン表現を構築するデュアルパス3Dインジェクトシーンエンコーディング(Dual-Path 3D-Injected Scene Encoding)と、タスク関連空間関係、幾何学的妥当性、期待されるゴール進捗を推論して各アクションを評価する空間的周囲アクション推論(Spatially-Grounded Action Reasoning)という2つの主要なコンポーネントを通して、候補アクションを評価する。これらのコンポーネントは、既存のVLAポリシーと完全に互換性を維持しながら、微妙だが結果クリティカルなアクション候補間のより信頼性の高い識別を可能にする。公開ベンチマークと実世界のロボット操作タスクの実験は、VeriSpaceが基盤となるVLAポリシーと事前の検証ベースの手法の両方に対して、決定の信頼性を一貫して改善し、配布内設定と配布外設定の両方で大幅に向上していることを示している。

論文の概要: VeriSpace: Spatially Grounded Action Verification for Vision-Language-Action Models

関連論文リスト