Fugu-MT 論文翻訳(概要): MatSciBench: Benchmarking the Reasoning Ability of Large Language Models in Materials Science

論文の概要: MatSciBench: Benchmarking the Reasoning Ability of Large Language Models in Materials Science

arxiv url: http://arxiv.org/abs/2510.12171v1
Date: Tue, 14 Oct 2025 05:59:40 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-15 19:02:32.19996
Title: MatSciBench: Benchmarking the Reasoning Ability of Large Language Models in Materials Science
Title（参考訳）: MatSciBench: 材料科学における大規模言語モデルの推論能力のベンチマーク
Authors: Junkai Zhang, Jingru Gan, Xiaoxuan Wang, Zian Jia, Changquan Gu, Jianpeng Chen, Yanqiao Zhu, Mingyu Derek Ma, Dawei Zhou, Ling Li, Wei Wang,
Abstract要約: MatSciBenchは1,340の問題からなる総合的な大学レベルのベンチマークである。 MatSciBenchは、物質科学の質問を6つの主要分野と31の亜分野に分類する構造的できめ細かな分類を特徴としている。先行モデルの評価によると、最高のパフォーマンスモデルであるGemini-2.5-Proでさえ、大学レベルの材料科学の質問に対して80%未満の精度で達成されている。
参考スコア（独自算出の注目度）: 28.11660982198711
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have demonstrated remarkable abilities in scientific reasoning, yet their reasoning capabilities in materials science remain underexplored. To fill this gap, we introduce MatSciBench, a comprehensive college-level benchmark comprising 1,340 problems that span the essential subdisciplines of materials science. MatSciBench features a structured and fine-grained taxonomy that categorizes materials science questions into 6 primary fields and 31 sub-fields, and includes a three-tier difficulty classification based on the reasoning length required to solve each question. MatSciBench provides detailed reference solutions enabling precise error analysis and incorporates multimodal reasoning through visual contexts in numerous questions. Evaluations of leading models reveal that even the highest-performing model, Gemini-2.5-Pro, achieves under 80% accuracy on college-level materials science questions, highlighting the complexity of MatSciBench. Our systematic analysis of different reasoning strategie--basic chain-of-thought, tool augmentation, and self-correction--demonstrates that no single method consistently excels across all scenarios. We further analyze performance by difficulty level, examine trade-offs between efficiency and accuracy, highlight the challenges inherent in multimodal reasoning tasks, analyze failure modes across LLMs and reasoning methods, and evaluate the influence of retrieval-augmented generation. MatSciBench thus establishes a comprehensive and solid benchmark for assessing and driving improvements in the scientific reasoning capabilities of LLMs within the materials science domain.
Abstract（参考訳）: 大規模言語モデル(LLM)は、科学的推論において顕著な能力を示してきたが、材料科学における推論能力は未解明のままである。このギャップを埋めるために、材料科学の必須分野にまたがる1,340の問題を総合的な大学レベルのベンチマークであるMatSciBenchを紹介する。 MatSciBenchは、物質科学の問題を6つの一次分野と31の亜分野に分類する構造的できめ細かな分類を特徴としている。 MatSciBenchは、正確なエラー解析を可能にする詳細な参照ソリューションを提供し、多くの質問における視覚的コンテキストによるマルチモーダル推論を取り入れている。最も優れたモデルであるGemini-2.5-Proでさえ、大学レベルの材料科学の疑問に対して80%以下の精度で達成し、MatSciBenchの複雑さを強調している。我々の体系的な分析では、異なる推論戦略、基本的連鎖、ツール強化、自己補正--は、すべてのシナリオで一貫した1つの方法が排他的ではないことを証明している。さらに、難易度による性能解析、効率と精度のトレードオフの検証、マルチモーダル推論タスクに固有の課題の強調、LCM間の障害モードの分析、および検索強化生成の影響の評価を行う。そこで、MatSciBenchは、材料科学領域内のLLMの科学的推論能力の改善を評価し、推進するための包括的で堅固なベンチマークを確立した。

論文の概要: MatSciBench: Benchmarking the Reasoning Ability of Large Language Models in Materials Science

関連論文リスト