Fugu-MT 論文翻訳(概要): EBench: Elemental Diagnosis of Generalist Mobile Manipulation Policies

論文の概要: EBench: Elemental Diagnosis of Generalist Mobile Manipulation Policies

arxiv url: http://arxiv.org/abs/2606.18239v1
Date: Tue, 16 Jun 2026 17:58:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-17 17:15:32.594764
Title: EBench: Elemental Diagnosis of Generalist Mobile Manipulation Policies
Title（参考訳）: EBench: 汎用モバイル操作ポリシの要素診断
Authors: Ning Gao, Jinliang Zheng, Xing Gao, Haoxiang Ma, Hanqing Wang, Yukai Wang, Jiantong Chen, Zanxin Chen, Shujie Zhang, Mingda Jia, Xuekun Jiang, Zihou Zhu, Xinyu Li, Shuai Wang, Hao Li, Wenzhe Cai, Yuqiang Yang, Xudong Xu, Zhaoyang Lyu, Yao Mu, Tai Wang, Jiangmiao Pang, Jia Zeng, Weinan Zhang, Chunhua Shen,
Abstract要約: 本稿では,一般のモバイル操作ポリシーを診断するシミュレーションベンチマークであるEBenchを紹介する。 EBenchは5つの能力次元と4つの一般化次元に沿ってアノテートされた26の多様で挑戦的な操作タスクで構成されている。
参考スコア（独自算出の注目度）: 92.63011025295123
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present EBench, a simulation benchmark that diagnoses generalist mobile manipulation policies beyond a single success-rate scalar. EBench comprises 26 diverse and challenging manipulation tasks annotated along 5 capability dimensions and 4 generalization dimensions. We evaluate state-of-the-art generalist manipulation models including $π_0$, $π_{0.5}$, XVLA, and InternVLA-A1, and reveal that models with near success rates exhibit strikingly different capability profiles: $π_{0.5}$ achieves the highest test success rate and the best train--test retention, whereas InternVLA-A1 dominates mobile manipulation but collapses on dexterous tasks, and XVLA exhibits strengths on a disjoint set of atomic skills compared to other policies. Beyond capability profiling, EBench analyzes the generalization ability from 4 representative perspectives, identifying the impact of different distribution shift factors. The results reveal strengths and weaknesses of models behind an overall score. We hope this benchmark offers a broad set of diagnostic signals to guide iteration on generalist manipulation models.
Abstract（参考訳）: EBenchは、一般的なモバイル操作ポリシーを単一の成功率スカラーを超えて診断するシミュレーションベンチマークである。 EBenchは5つの能力次元と4つの一般化次元に沿ってアノテートされた26の多様で挑戦的な操作タスクで構成されている。 π_0$, $π_{0.5}$, XVLA, InternVLA-A1といった最先端のジェネラリスト操作モデルを評価し, ほぼ成功率のモデルでは, 高いテスト成功率と最高のトレイン-テスト保持率を達成でき, 一方, InternVLA-A1は移動操作を支配しているが, 器用なタスクでは崩壊する。能力プロファイル以外にも、EBenchは4つの代表的な視点から一般化能力を分析し、異なる分布シフト要因の影響を特定する。結果は、全体的なスコアの背後にあるモデルの長所と短所を明らかにします。このベンチマークは、一般的な操作モデルの反復をガイドする幅広い診断信号を提供してくれることを願っている。

論文の概要: EBench: Elemental Diagnosis of Generalist Mobile Manipulation Policies

関連論文リスト