Fugu-MT 論文翻訳(概要): Engineering Reasoning and Instruction (ERI) Benchmark: A Large Taxonomy-driven Dataset for Foundation Models and Agents

論文の概要: Engineering Reasoning and Instruction (ERI) Benchmark: A Large Taxonomy-driven Dataset for Foundation Models and Agents

arxiv url: http://arxiv.org/abs/2603.02239v1
Date: Mon, 16 Feb 2026 12:38:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 01:20:08.105971
Title: Engineering Reasoning and Instruction (ERI) Benchmark: A Large Taxonomy-driven Dataset for Foundation Models and Agents
Title（参考訳）: 工学推論とインストラクション(ERI)ベンチマーク:基礎モデルとエージェントのための大規模な分類学駆動データセット
Authors: MZ Naser, Ahmad Bani Awwad, Zoie McCreery, Radwa Eissa, Ahmad Naser, Gianluca Cusatis, Andrew Metcalf, Kapil Madathil, Jamal Abdalla, Venkatesh Kodur, Mohammad Reza Saeb,
Abstract要約: Engineering Reasoning and Instruction (ERI) ベンチマークは、工学能力のある大規模言語モデル(LLM)とエージェントをトレーニングし、評価するために設計された分類による命令データセットである。このデータセットは、9つの工学分野(土木、機械、電気、化学、環境、航空宇宙、材料、火、産業工学)と55に及び、7つの目的タイプ(定義、説明、計算、比較、設計/合成、トラブルシューティング、コード関連)と3つの困難層(学部、卒業生、プロフェッショナル)にまたがる。 ERIは、分類仕様、検証スクリプト、評価ハーネスと共にリリースされている。
参考スコア（独自算出の注目度）: 1.629288881045104
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The Engineering Reasoning and Instruction (ERI) benchmark is a taxonomy-driven instruction dataset designed to train and evaluate engineering-capable large language models (LLMs) and agents. This dataset spans nine engineering fields (namely: civil, mechanical, electrical, chemical, environmental, aerospace, materials, fire, and industrial engineering) and 55 subdomains, and is crossed with seven intent types (i.e., definition, explanation, calculation, comparison, design/synthesis, troubleshooting, and code-related) and three difficulty tiers (undergraduate, graduate, and professional), yielding 57,750 records with field/subdomain/type/difficulty metadata and solution formatting. We examined ERI via seven LLMs and report a statistically significant three-tier performance structure, with frontier models (GPT-5, Claude Sonnet 4, DeepSeek V3.1) achieving mean scores above 4.30 on a five-point scale, while mid-tier and smaller models exhibited progressively higher failure rates and steeper performance degradation on graduate-level questions. To address circularity concerns inherent in LLM benchmarks, we developed a convergent validation protocol that leverages cross-provider independence, multi-judge averaging, and frontier-model agreement analysis to empirically bound hallucination risk to 1.7%. ERI is released with taxonomy specifications, validation scripts, and an evaluation harness to enable reproducible comparisons and regression testing for instruction tuning, routing, retrieval-augmented evaluation, and agentic tool-use workflows in engineering settings.
Abstract（参考訳）: Engineering Reasoning and Instruction (ERI)ベンチマークは、エンジニアリング対応の大規模言語モデル(LLM)とエージェントをトレーニングし、評価するために設計された分類に基づく命令データセットである。このデータセットは、9つの工学分野(土木、機械、電気、化学、環境、航空宇宙、材料、火、産業工学)と55のサブドメインにまたがっており、定義、説明、計算、比較、設計/合成、トラブルシューティング、コード関連)と3つの困難層(学部、大学院、専門職)に分かれ、フィールド/サブドメイン/タイプ/分散メタデータとソリューションフォーマットで57,750レコードを出力している。 ERIを7 LLMを用いて検討し,フロンティアモデル(GPT-5,Claude Sonnet 4,DeepSeek V3.1)を用いて5点スケールで平均スコアを4.30以上達成した。 LLMベンチマークに固有の円形性問題に対処するため,クロスプロジェクタ独立性,マルチジャッジ平均化,フロンティアモデル合意分析を利用した収束検証プロトコルを開発し,幻覚リスクを1.7%に限定した。 ERIは、分類仕様、検証スクリプト、そして、エンジニアリング設定における命令チューニング、ルーティング、検索強化評価、エージェントツール使用ワークフローの再現可能な比較と回帰テストを可能にする評価ハーネスとともにリリースされている。

論文の概要: Engineering Reasoning and Instruction (ERI) Benchmark: A Large Taxonomy-driven Dataset for Foundation Models and Agents

関連論文リスト