Fugu-MT 論文翻訳(概要): AECBench: A Hierarchical Benchmark for Knowledge Evaluation of Large Language Models in the AEC Field

論文の概要: AECBench: A Hierarchical Benchmark for Knowledge Evaluation of Large Language Models in the AEC Field

arxiv url: http://arxiv.org/abs/2509.18776v1
Date: Tue, 23 Sep 2025 08:09:58 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-24 20:41:27.768026
Title: AECBench: A Hierarchical Benchmark for Knowledge Evaluation of Large Language Models in the AEC Field
Title（参考訳）: AECBench: AEC分野における大規模言語モデルの知識評価のための階層的ベンチマーク
Authors: Chen Liang, Zhaoqi Huang, Haofen Wang, Fu Chai, Chunying Yu, Huanhuan Wei, Zhengjie Liu, Yanpeng Li, Hongjun Wang, Ruifeng Luo, Xianzhong Zhao,
Abstract要約: 大規模言語モデル(LLM)は、アーキテクチャ、エンジニアリング、建設(AEC)分野において採用が増えている。本稿では,AEC領域における現在のLLMの強度と限界を定量化するベンチマークであるAECBenchを確立する。ベンチマークでは、23の代表的なタスクを5段階の認知指向評価フレームワークで定義している。
参考スコア（独自算出の注目度）: 12.465017512854475
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large language models (LLMs), as a novel information technology, are seeing increasing adoption in the Architecture, Engineering, and Construction (AEC) field. They have shown their potential to streamline processes throughout the building lifecycle. However, the robustness and reliability of LLMs in such a specialized and safety-critical domain remain to be evaluated. To address this challenge, this paper establishes AECBench, a comprehensive benchmark designed to quantify the strengths and limitations of current LLMs in the AEC domain. The benchmark defines 23 representative tasks within a five-level cognition-oriented evaluation framework encompassing Knowledge Memorization, Understanding, Reasoning, Calculation, and Application. These tasks were derived from authentic AEC practice, with scope ranging from codes retrieval to specialized documents generation. Subsequently, a 4,800-question dataset encompassing diverse formats, including open-ended questions, was crafted primarily by engineers and validated through a two-round expert review. Furthermore, an LLM-as-a-Judge approach was introduced to provide a scalable and consistent methodology for evaluating complex, long-form responses leveraging expert-derived rubrics. Through the evaluation of nine LLMs, a clear performance decline across five cognitive levels was revealed. Despite demonstrating proficiency in foundational tasks at the Knowledge Memorization and Understanding levels, the models showed significant performance deficits, particularly in interpreting knowledge from tables in building codes, executing complex reasoning and calculation, and generating domain-specific documents. Consequently, this study lays the groundwork for future research and development aimed at the robust and reliable integration of LLMs into safety-critical engineering practices.
Abstract（参考訳）: 大規模言語モデル(LLM)は、新しい情報技術として、アーキテクチャ、エンジニアリング、建設(AEC)分野への採用が増えている。彼らは、ビルドライフサイクル全体を通してプロセスの合理化の可能性を示した。しかし、そのような専門的かつ安全クリティカルな領域におけるLSMの堅牢性と信頼性は評価されていない。この課題に対処するために、AECドメインにおける現在のLLMの強度と限界を定量化するために設計された総合的なベンチマークであるAECBenchを確立する。このベンチマークでは、知識記憶、理解、推論、計算、アプリケーションを含む5段階の認知指向評価フレームワーク内で23の代表的なタスクを定義している。これらのタスクは、コード検索から特殊な文書生成まで幅広い範囲で、AECの実践から派生したものである。その後、オープンエンドの質問を含むさまざまなフォーマットを含む4,800のクエストデータセットが、主にエンジニアによって作成され、2ラウンドのエキスパートレビューを通じて検証された。さらに, LLM-as-a-Judgeアプローチを導入し, 専門家由来のルーブリックを利用した複雑な長文応答を評価するための, スケーラブルで一貫した方法論を提供した。 9個のLDMを評価した結果,5つの認知レベルにおいて明らかな性能低下が認められた。知識覚書化と理解のレベルでの基礎的なタスクの習熟度を示したにもかかわらず、これらのモデルは、特にコード構築における表からの知識の解釈、複雑な推論と計算の実行、ドメイン固有の文書の生成において、大きなパフォーマンス上の欠陥を示した。本研究は,LLMの安全クリティカルエンジニアリング実践への堅牢で信頼性の高い統合を目指して,今後の研究開発の基盤となるものと考えられる。

論文の概要: AECBench: A Hierarchical Benchmark for Knowledge Evaluation of Large Language Models in the AEC Field

関連論文リスト