Fugu-MT 論文翻訳(概要): Trustworthy Evaluation of Robotic Manipulation: A New Benchmark and AutoEval Methods

論文の概要: Trustworthy Evaluation of Robotic Manipulation: A New Benchmark and AutoEval Methods

arxiv url: http://arxiv.org/abs/2601.18723v1
Date: Mon, 26 Jan 2026 17:47:42 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:40.955228
Title: Trustworthy Evaluation of Robotic Manipulation: A New Benchmark and AutoEval Methods
Title（参考訳）: ロボットマニピュレーションの信頼性評価:新しいベンチマークとオートエバル手法
Authors: Mengyuan Liu, Juyi Sheng, Peiming Li, Ziyi Wang, Tianming Xu, Tiantian Xu, Hong Liu,
Abstract要約: Eval-ActionsベンチマークとAutoEvalアーキテクチャを組み合わせたソリューションを提案する。このデータセットは、Expert Grading(EG)、Rang-Guided preferences(RG)、Chain-of-Thought(CoT)の3つのコア監視信号を中心に構成されている。 AutoEval は EG プロトコルと RG プロトコルでそれぞれ 0.81 と 0.84 のSpearman's Rank correlation Coefficients (SRCC) を達成している。
参考スコア（独自算出の注目度）: 30.612032540735402
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Driven by the rapid evolution of Vision-Action and Vision-Language-Action models, imitation learning has significantly advanced robotic manipulation capabilities. However, evaluation methodologies have lagged behind, hindering the establishment of Trustworthy Evaluation for these behaviors. Current paradigms rely on binary success rates, failing to address the critical dimensions of trust: Source Authenticity (i.e., distinguishing genuine policy behaviors from human teleoperation) and Execution Quality (e.g., smoothness and safety). To bridge these gaps, we propose a solution that combines the Eval-Actions benchmark and the AutoEval architecture. First, we construct the Eval-Actions benchmark to support trustworthiness analysis. Distinct from existing datasets restricted to successful human demonstrations, Eval-Actions integrates VA and VLA policy execution trajectories alongside human teleoperation data, explicitly including failure scenarios. This dataset is structured around three core supervision signals: Expert Grading (EG), Rank-Guided preferences (RG), and Chain-of-Thought (CoT). Building on this, we propose the AutoEval architecture: AutoEval leverages Spatio-Temporal Aggregation for semantic assessment, augmented by an auxiliary Kinematic Calibration Signal to refine motion smoothness; AutoEval Plus (AutoEval-P) incorporates the Group Relative Policy Optimization (GRPO) paradigm to enhance logical reasoning capabilities. Experiments show AutoEval achieves Spearman's Rank Correlation Coefficients (SRCC) of 0.81 and 0.84 under the EG and RG protocols, respectively. Crucially, the framework possesses robust source discrimination capabilities, distinguishing between policy-generated and teleoperated videos with 99.6% accuracy, thereby establishing a rigorous standard for trustworthy robotic evaluation. Our project and code are available at https://term-bench.github.io/.
Abstract（参考訳）: Vision-ActionとVision-Language-Actionモデルの急速な進化により、模倣学習はロボット操作能力を大幅に進歩させた。しかし、評価手法が遅れており、これらの行動に対する信頼に値する評価の確立を妨げている。現在のパラダイムは二進的成功率に依存しており、信頼の臨界次元に対処できない: 情報源の正当性(すなわち、真の政策行動と人間の遠隔操作を区別する)と実行品質(例えば、滑らかさと安全性)。これらのギャップを埋めるために、Eval-ActionsベンチマークとAutoEvalアーキテクチャを組み合わせたソリューションを提案する。まず,信頼度分析を支援するためにEval-Actionsベンチマークを構築した。既存のデータセットとは違い、Eval-ActionsはVAとVLAポリシー実行トラジェクトリを人間の遠隔操作データと統合し、障害シナリオを明示的に含んでいる。このデータセットは、Expert Grading(EG)、Rang-Guided preferences(RG)、Chain-of-Thought(CoT)の3つのコア監視信号を中心に構成されている。これに基づいてAutoEvalアーキテクチャを提案する: AutoEvalは、セマンティックアセスメントに時空間アグリゲーションを活用し、運動のスムーズさを向上するために補助的なKinematic Calibration Signalによって強化され、AutoEval Plus(AutoEval-P)は、論理的推論能力を高めるためにグループ相対ポリシー最適化(GRPO)パラダイムを取り入れている。 AutoEval は EG プロトコルと RG プロトコルでそれぞれ 0.81 と 0.84 のSpearman's Rank correlation Coefficients (SRCC) を達成している。重要なことに、このフレームワークは堅牢なソース識別能力を持ち、ポリシー生成ビデオと遠隔操作ビデオとを99.6%の精度で区別し、信頼性の高いロボット評価のための厳格な基準を確立している。私たちのプロジェクトとコードはhttps://term-bench.github.io/.com/で公開されています。

論文の概要: Trustworthy Evaluation of Robotic Manipulation: A New Benchmark and AutoEval Methods

関連論文リスト