Fugu-MT 論文翻訳(概要): Beyond Binary Success: Sample-Efficient and Statistically Rigorous Robot Policy Comparison

論文の概要: Beyond Binary Success: Sample-Efficient and Statistically Rigorous Robot Policy Comparison

arxiv url: http://arxiv.org/abs/2603.13616v1
Date: Fri, 13 Mar 2026 21:47:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-28 17:42:31.5953
Title: Beyond Binary Success: Sample-Efficient and Statistically Rigorous Robot Policy Comparison
Title（参考訳）: 二元的成功を超えて: 有効で統計的に厳格なロボット政策の比較
Authors: David Snyder, Apurva Badithela, Nikolai Matni, George Pappas, Anirudha Majumdar, Masha Itkina, Haruki Nishimura,
Abstract要約: 汎用的なロボット操作ポリシーはますます有能になりつつあるが、少数のハードウェアロールアウトに限られている。本研究は, サンプル効率が高く, 統計的に厳密で, 実際に使用される幅広い評価指標に適用可能な, ロボット政策比較のための新しい枠組みを提案する。
参考スコア（独自算出の注目度）: 17.732982117200425
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Generalist robot manipulation policies are becoming increasingly capable, but are limited in evaluation to a small number of hardware rollouts. This strong resource constraint in real-world testing necessitates both more informative performance measures and reliable and efficient evaluation procedures to properly assess model capabilities and benchmark progress in the field. This work presents a novel framework for robot policy comparison that is sample-efficient, statistically rigorous, and applicable to a broad set of evaluation metrics used in practice. Based on safe, anytime-valid inference (SAVI), our test procedure is sequential, allowing the evaluator to stop early when sufficient statistical evidence has accumulated to reach a decision at a pre-specified level of confidence. Unlike previous work developed for binary success, our unified approach addresses a wide range of informative metrics: from discrete partial credit task progress to continuous measures of episodic reward or trajectory smoothness, spanning both parametric and nonparametric comparison problems. Through extensive validation on simulated and real-world evaluation data, we demonstrate up to 70% reduction in evaluation burden compared to standard batch methods and up to 50% reduction compared to state-of-the-art sequential procedures designed for binary outcomes, with no loss of statistical rigor. Notably, our empirical results show that competing policies can be separated more quickly when using fine-grained task progress than binary success metrics.
Abstract（参考訳）: 汎用的なロボット操作ポリシーはますます有能になりつつあるが、少数のハードウェアロールアウトに限られている。実世界のテストにおけるこの強いリソース制約は、現場におけるモデルの能力とベンチマークの進捗を適切に評価するために、より情報的なパフォーマンス測定と信頼性と効率的な評価手順の両方を必要とする。本研究は, サンプル効率が高く, 統計的に厳密で, 実際に使用される幅広い評価指標に適用可能な, ロボット政策比較のための新しい枠組みを提案する。安全かつ有意な推論 (SAVI) に基づいて、我々のテスト手順は逐次的であり、十分な統計的証拠が蓄積されたときに評価者が所定の信頼度で決定に達するのを早めに止めることができる。二元的成功のために開発された従来の研究とは異なり、我々の統一されたアプローチは、離散的な部分的信用タスクの進行から、パラメトリックと非パラメトリックの比較問題の両方にまたがる、漸進的な報酬や軌道の滑らかさの連続的な測定まで、幅広い情報的指標に対処する。シミュレーションおよび実世界の評価データに対する広範囲な検証を通じて、標準的なバッチ手法と比較して最大70%評価負担を削減し、統計的厳密さを損なうことなく、バイナリ結果に設計した最先端のシーケンシャルな手順と比較して最大50%削減することを示した。特に、我々の経験的結果は、二進的成功指標よりもきめ細かなタスク進捗を使用する場合、競合するポリシーをより迅速に分離できることを示している。

論文の概要: Beyond Binary Success: Sample-Efficient and Statistically Rigorous Robot Policy Comparison

関連論文リスト