Fugu-MT 論文翻訳(概要): LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization

論文の概要: LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization

arxiv url: http://arxiv.org/abs/2510.03827v1
Date: Sat, 04 Oct 2025 14:56:40 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-07 16:52:59.280611
Title: LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization
Title（参考訳）: LIBERO-PRO:記憶以外の視覚・言語・アクションモデルのロバストかつ公正な評価に向けて
Authors: Xueyang Zhou, Yangming Xu, Guiyao Tie, Yongchao Chen, Guowen Zhang, Duanfeng Chu, Pan Zhou, Lichao Sun,
Abstract要約: LIBEROはVision-Language-Action (VLA)モデルを評価するための広く採用されているベンチマークとして登場した。本稿では,モデル性能を合理的な摂動下で体系的に評価する拡張LIBEROベンチマークであるLIBERO-PROを紹介する。実験の結果,既存のモデルでは標準LIBERO評価では90%以上の精度が得られたが,一般設定では0.0%に低下した。
参考スコア（独自算出の注目度）: 33.562814882942
License: http://creativecommons.org/licenses/by/4.0/
Abstract: LIBERO has emerged as a widely adopted benchmark for evaluating Vision-Language-Action (VLA) models; however, its current training and evaluation settings are problematic, often leading to inflated performance estimates and preventing fair model comparison. To address these issues, we introduce LIBERO-PRO, an extended LIBERO benchmark that systematically evaluates model performance under reasonable perturbations across four dimensions: manipulated objects, initial states, task instructions, and environments. Experimental results reveal that, although existing models achieve over 90% accuracy under the standard LIBERO evaluation, their performance collapses to 0.0% under our generalized setting. Crucially, this discrepancy exposes the models' reliance on rote memorization of action sequences and environment layouts from the training set, rather than genuine task understanding or environmental perception. For instance, models persist in executing grasping actions when the target object is replaced with irrelevant items, and their outputs remain unchanged even when given corrupted instructions or even messy tokens. These findings expose the severe flaws in current evaluation practices, and we call on the community to abandon misleading methodologies in favor of robust assessments of model generalization and comprehension. Our code is available at: https://github.com/Zxy-MLlab/LIBERO-PRO.
Abstract（参考訳）: LIBEROはVLA(Vision-Language-Action)モデルを評価するための広く採用されているベンチマークとして登場したが、現在のトレーニングと評価設定には問題があり、しばしば性能評価が膨らませられ、公正なモデル比較が防止される。これらの問題に対処するために、LIBERO-PROという拡張されたLIBEROベンチマークを導入し、オブジェクトの操作、初期状態、タスク命令、環境の4次元にわたる適切な摂動の下でモデル性能を体系的に評価する。実験の結果,既存のモデルでは標準LIBERO評価では90%以上の精度が得られたが,一般設定では0.0%に低下した。この違いは、実際のタスク理解や環境認識よりも、トレーニングセットからのアクションシーケンスと環境レイアウトのロート記憶にモデルが依存していることを明らかにする。例えば、ターゲットオブジェクトが無関係なアイテムに置き換えられたとき、モデルは把握アクションの実行を継続する。これらの結果は,現在の評価実践の深刻な欠陥を明らかにし,モデル一般化と理解の堅牢な評価を優先して,誤解を招く方法論を放棄するようコミュニティに呼びかけている。私たちのコードは、https://github.com/Zxy-MLlab/LIBERO-PRO.comで利用可能です。

論文の概要: LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization

関連論文リスト