Fugu-MT 論文翻訳(概要): ERQA-Plus: A Diagnostic Benchmark for Reasoning in Embodied AI

論文の概要: ERQA-Plus: A Diagnostic Benchmark for Reasoning in Embodied AI

arxiv url: http://arxiv.org/abs/2606.17639v2
Date: Wed, 17 Jun 2026 07:18:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-18 13:57:35.223221
Title: ERQA-Plus: A Diagnostic Benchmark for Reasoning in Embodied AI
Title（参考訳）: ERQA-Plus: 身体的AIにおける推論のための診断ベンチマーク
Authors: Hong Yang, Basura Fernando,
Abstract要約: ERQA-Plusは、組み込みAIの推論のための診断ベンチマークである。 711のロボット中心の画像に1,766件の質問応答がある。
参考スコア（独自算出の注目度）: 14.957780321740394
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Generalist embodied agents require more than object recognition: they must reason about spatial relations, actions, procedures, human intentions, environmental constraints, and commonsense consequences from situated visual observations. Yet existing visual and embodied question answering benchmarks often provide limited control over the reasoning dependencies being tested, making it difficult to distinguish grounded embodied reasoning from shortcut-driven visual or linguistic pattern matching. We present ERQA-Plus, a diagnostic benchmark for reasoning in embodied AI. ERQA-Plus contains 1,766 question-answer instances grounded in 711 robot-centric images and organized according to a structured taxonomy spanning perceptual, action-centric, social-interaction, navigation-environmental, and contextual commonsense reasoning. The dataset is constructed using a multi-stage generation and validation pipeline that combines taxonomy-guided question generation, automatic quality judging, iterative revision, and human assessment to improve visual grounding, answer validity, and reasoning quality. We benchmark representative general-purpose vision-language models and embodied models, including LLaVA-NeXT-8B, Prismatic-7B, MiniCPM-V-4.5-8B, Qwen3-VL, RoboRefer-8B, and RoboBrain2.5-8B. Although the strongest model, Qwen3-VL-32B, achieves 83.4% overall accuracy and 61.4 SBERT score, category-level results reveal persistent weaknesses in spatial reasoning, procedural reasoning, event prediction, and intention inference. ERQA-Plus therefore provides a fine-grained evaluation framework for measuring not only whether embodied agents answer correctly, but also which forms of embodied reasoning they can and cannot perform reliably. The dataset is available https://huggingface.co/datasets/huggingdas/erqa-plus and the project page at https://github.com/LUNAProject22/erqa-plus.
Abstract（参考訳）: 彼らは空間的関係、行動、手続き、人的意図、環境制約、そして位置する視覚的観察による常識的な結果について考える必要がある。しかし、既存の視覚的および具体的質問応答ベンチマークは、テスト中の推論依存性を限定的に制御するので、ショートカット駆動の視覚的または言語的パターンマッチングから基礎的な具体的推論を区別することは困難である。具体的AIの推論のための診断ベンチマークであるERQA-Plusを提案する。 ERQA-Plusは711のロボット中心の画像に基づき、知覚、行動中心、社会的相互作用、ナビゲーション環境、コンテキストのコモンセンス推論にまたがる構造的分類に基づいて構成された1,766の質問応答インスタンスを含んでいる。このデータセットは、分類誘導質問生成、自動品質判定、反復修正、人的評価を組み合わせて、視覚的グラウンド化、回答妥当性、推論品質を改善する多段階生成検証パイプラインを用いて構築される。 LLaVA-NeXT-8B, Prismatic-7B, MiniCPM-V-4.5-8B, Qwen3-VL, RoboRefer-8B, RoboBrain2.5-8B などの汎用視覚言語モデルとエンボディモデルのベンチマークを行った。最強のモデルであるQwen3-VL-32Bは83.4%の精度と61.4のSBERTスコアを達成したが、カテゴリーレベルの結果は空間的推論、手続き的推論、イベント予測、意図推論において永続的な弱点を示す。したがって、ERQA-Plusは、エンボディエージェントが正しく答えるかどうかを測定するためのきめ細かい評価フレームワークを提供する。データセットはhttps://huggingface.co/datasets/huggingdas/erqa-plus、プロジェクトページはhttps://github.com/LUNAProject22/erqa-plusにある。

論文の概要: ERQA-Plus: A Diagnostic Benchmark for Reasoning in Embodied AI

関連論文リスト