Fugu-MT 論文翻訳(概要): RADAR: Benchmarking Vision-Language-Action Generalization via Real-World Dynamics, Spatial-Physical Intelligence, and Autonomous Evaluation

論文の概要: RADAR: Benchmarking Vision-Language-Action Generalization via Real-World Dynamics, Spatial-Physical Intelligence, and Autonomous Evaluation

arxiv url: http://arxiv.org/abs/2602.10980v1
Date: Wed, 11 Feb 2026 16:08:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:41.370022
Title: RADAR: Benchmarking Vision-Language-Action Generalization via Real-World Dynamics, Spatial-Physical Intelligence, and Autonomous Evaluation
Title（参考訳）: RADAR: 実世界のダイナミクス、空間物理知能、自律的評価によるビジョン・ランゲージ・アクションの一般化のベンチマーク
Authors: Yuhao Chen, Zhihao Zhan, Xiaoxin Lin, Zijian Song, Hao Liu, Qinhan Lyu, Yubo Zu, Xiao Chen, Zhiyuan Liu, Tao Pu, Tianshui Chen, Keze Wang, Liang Lin, Guangrun Wang,
Abstract要約: 本稿では,現実的な条件下でのVLA一般化を体系的に評価するベンチマークであるRADARを紹介する。 RADARを用いて、複数の最先端のVLAモデルを監査し、その明らかな能力の下で深刻な脆弱性を明らかにする。
参考スコア（独自算出の注目度）: 76.22852262683746
License: http://creativecommons.org/licenses/by/4.0/
Abstract: VLA models have achieved remarkable progress in embodied intelligence; however, their evaluation remains largely confined to simulations or highly constrained real-world settings. This mismatch creates a substantial reality gap, where strong benchmark performance often masks poor generalization in diverse physical environments. We identify three systemic shortcomings in current benchmarking practices that hinder fair and reliable model comparison. (1) Existing benchmarks fail to model real-world dynamics, overlooking critical factors such as dynamic object configurations, robot initial states, lighting changes, and sensor noise. (2) Current protocols neglect spatial--physical intelligence, reducing evaluation to rote manipulation tasks that do not probe geometric reasoning. (3) The field lacks scalable fully autonomous evaluation, instead relying on simplistic 2D metrics that miss 3D spatial structure or on human-in-the-loop systems that are costly, biased, and unscalable. To address these limitations, we introduce RADAR (Real-world Autonomous Dynamics And Reasoning), a benchmark designed to systematically evaluate VLA generalization under realistic conditions. RADAR integrates three core components: (1) a principled suite of physical dynamics; (2) dedicated tasks that explicitly test spatial reasoning and physical understanding; and (3) a fully autonomous evaluation pipeline based on 3D metrics, eliminating the need for human supervision. We apply RADAR to audit multiple state-of-the-art VLA models and uncover severe fragility beneath their apparent competence. Performance drops precipitously under modest physical dynamics, with the expectation of 3D IoU declining from 0.261 to 0.068 under sensor noise. Moreover, models exhibit limited spatial reasoning capability. These findings position RADAR as a necessary bench toward reliable and generalizable real-world evaluation of VLA models.
Abstract（参考訳）: VLAモデルはインテリジェンスにおいて顕著な進歩を遂げているが、その評価はシミュレーションや現実世界の設定に限られている。このミスマッチは、強力なベンチマークパフォーマンスが様々な物理環境における一般化の悪さを隠蔽する、実質的な現実のギャップを生じさせる。公正で信頼性の高いモデル比較を妨げる、現在のベンチマークプラクティスにおける3つのシステム的欠点を特定します。 1)既存のベンチマークでは,オブジェクトの動的構成,ロボットの初期状態,照明変化,センサノイズといった重要な要因を克服し,実世界のダイナミクスのモデル化に失敗する。 2)現在のプロトコルは空間物理知能を無視し,幾何学的推論を探索しない操作タスクに対する評価を低減している。 3) 現場はスケーラブルで完全自律的な評価に欠けており、代わりに3次元空間構造を欠く単純化された2Dメトリクスや、費用がかかり、偏りがあり、計算不可能なループシステムに依存している。これらの制約に対処するために,現実的な条件下でのVLA一般化を体系的に評価するためのベンチマークであるRADAR(Real-world Autonomous Dynamics and Reasoning)を導入する。 RADARは,(1)物理力学の原理的スイート,(2)空間的推論と物理的理解を明示的にテストする専用タスク,(3)3次元計測に基づく完全自律的評価パイプラインの3つのコアコンポーネントを統合し,人間の監督の必要性を排除した。 RADARを用いて、複数の最先端のVLAモデルを監査し、その明らかな能力の下で深刻な脆弱性を明らかにする。 3D IoU は 0.261 から 0.068 に低下した。さらに、モデルは限られた空間的推論能力を示す。これらの結果から, RADAR は VLA モデルの信頼性, 一般化可能な実世界評価に必要となるベンチとして位置づけられた。

論文の概要: RADAR: Benchmarking Vision-Language-Action Generalization via Real-World Dynamics, Spatial-Physical Intelligence, and Autonomous Evaluation

関連論文リスト