Fugu-MT 論文翻訳(概要): EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving

論文の概要: EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving

arxiv url: http://arxiv.org/abs/2509.17677v1
Date: Mon, 22 Sep 2025 12:20:27 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-23 18:58:16.368991
Title: EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving
Title（参考訳）: EngiBench: エンジニアリング問題解決における大規模言語モデル評価ベンチマーク
Authors: Xiyuan Zhou, Xinlei Wang, Yirui He, Yang Wu, Ruixi Zou, Yuheng Cheng, Yulu Xie, Wenxuan Liu, Huan Zhao, Yan Xu, Jinjin Gu, Junhua Zhao,
Abstract要約: 本稿では,工学的問題を解決する上で,大規模言語モデル(LLM)を評価する階層的なベンチマークであるEngiBenchを紹介する。難易度(基礎知識検索、多段階の文脈推論、オープンエンドモデリング)の3段階に及び、多様なエンジニアリングサブフィールドをカバーする。モデルはタスクが難しくなるにつれて苦労するし、問題がわずかに変化してもパフォーマンスが悪くなる。
参考スコア（独自算出の注目度）: 37.708900742664184
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have shown strong performance on mathematical reasoning under well-posed conditions. However, real-world engineering problems require more than mathematical symbolic computation -- they need to deal with uncertainty, context, and open-ended scenarios. Existing benchmarks fail to capture these complexities. We introduce EngiBench, a hierarchical benchmark designed to evaluate LLMs on solving engineering problems. It spans three levels of increasing difficulty (foundational knowledge retrieval, multi-step contextual reasoning, and open-ended modeling) and covers diverse engineering subfields. To facilitate a deeper understanding of model performance, we systematically rewrite each problem into three controlled variants (perturbed, knowledge-enhanced, and math abstraction), enabling us to separately evaluate the model's robustness, domain-specific knowledge, and mathematical reasoning abilities. Experiment results reveal a clear performance gap across levels: models struggle more as tasks get harder, perform worse when problems are slightly changed, and fall far behind human experts on the high-level engineering tasks. These findings reveal that current LLMs still lack the high-level reasoning needed for real-world engineering, highlighting the need for future models with deeper and more reliable problem-solving capabilities. Our source code and data are available at https://github.com/EngiBench/EngiBench.
Abstract（参考訳）: 大規模言語モデル(LLM)は、適切な条件下での数学的推論に強い性能を示す。しかし、現実のエンジニアリング問題には数学的記号計算以上のものが必要であり、不確実性、コンテキスト、そしてオープンなシナリオを扱う必要がある。既存のベンチマークでは、これらの複雑さを捉えられません。本稿では,工学的問題を解決する上でLLMを評価するために設計された階層型ベンチマークであるEngiBenchを紹介する。難易度(基礎知識検索、多段階の文脈推論、オープンエンドモデリング)の3段階に及び、様々なエンジニアリングサブフィールドをカバーする。モデル性能のより深い理解を容易にするため、各問題を3つの制御された変種(摂動、知識強化、数学の抽象)に体系的に書き直し、モデルの堅牢性、ドメイン固有の知識、数学的推論能力を個別に評価することができる。モデルはタスクが難しくなるにつれて苦労するし、問題がわずかに変化してもパフォーマンスが悪くなる。これらの結果から,現在のLLMには,より深く信頼性の高い問題解決能力を備えた将来のモデルの必要性が浮かび上がっている。ソースコードとデータはhttps://github.com/EngiBench/EngiBench.comで公開されています。

論文の概要: EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving

関連論文リスト