Fugu-MT 論文翻訳(概要): MaterialFigBENCH: benchmark dataset with figures for evaluating college-level materials science problem-solving abilities of multimodal large language models

論文の概要: MaterialFigBENCH: benchmark dataset with figures for evaluating college-level materials science problem-solving abilities of multimodal large language models

arxiv url: http://arxiv.org/abs/2603.11414v1
Date: Thu, 12 Mar 2026 01:04:32 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-13 14:46:25.799921
Title: MaterialFigBENCH: benchmark dataset with figures for evaluating college-level materials science problem-solving abilities of multimodal large language models
Title（参考訳）: MaterialFigBENCH:マルチモーダル大言語モデルの大学レベルの教材科学問題解決能力を評価するためのベンチマークデータセット
Authors: Michiko Yoshitake, Yuta Suzuki, Ryo Igarashi, Yoshitaka Ushiku, Keisuke Nagato,
Abstract要約: 本稿では,大学レベルの物質科学問題を解くためのマルチモーダル大規模言語モデル(LLM)の能力を評価するためのベンチマークデータセットを提案する。 MaterialFigBenchは、位相図、応力-ひずみ曲線、アレニウスプロット、回折パターン、微構造図といった図形が正しい解を得るには不可欠である問題に焦点を当てている。
参考スコア（独自算出の注目度）: 9.321900260858351
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present MaterialFigBench, a benchmark dataset designed to evaluate the ability of multimodal large language models (LLMs) to solve university-level materials science problems that require accurate interpretation of figures. Unlike existing benchmarks that primarily rely on textual representations, MaterialFigBench focuses on problems in which figures such as phase diagrams, stress-strain curves, Arrhenius plots, diffraction patterns, and microstructural schematics are indispensable for deriving correct answers. The dataset consists of 137 free-response problems adapted from standard materials science textbooks, covering a broad range of topics including crystal structures, mechanical properties, diffusion, phase diagrams, phase transformations, and electronic properties of materials. To address unavoidable ambiguity in reading numerical values from images, expert-defined answer ranges are provided where appropriate. We evaluate several state-of-the-art multimodal LLMs, including ChatGPT and GPT models accessed via OpenAI APIs, and analyze their performance across problem categories and model versions. The results reveal that, although overall accuracy improves with model updates, current LLMs still struggle with genuine visual understanding and quantitative interpretation of materials science figures. In many cases, correct answers are obtained by relying on memorized domain knowledge rather than by reading the provided images. MaterialFigBench highlights persistent weaknesses in visual reasoning, numerical precision, and significant-digit handling, while also identifying problem types where performance has improved. This benchmark provides a systematic and domain-specific foundation for advancing multimodal reasoning capabilities in materials science and for guiding the development of future LLMs with stronger figure-based understanding.
Abstract（参考訳）: 数値の正確な解釈を必要とする大学レベルの物質科学問題を解くために,LLM(Multimodal large language model)の能力を評価するために設計されたベンチマークデータセットであるMaterialFigBenchを提案する。テキスト表現に主に依存する既存のベンチマークとは異なり、MaterialFigBenchは位相図、応力-ひずみ曲線、アレニウスプロット、回折パターン、微構造図などの数字が正しい答えを導き出すのに欠かせない問題に焦点を当てている。このデータセットは、標準材料科学教科書から適応した137の自由応答問題で構成され、結晶構造、機械的特性、拡散、相図、相変換、材料の電子的性質など幅広いトピックをカバーしている。画像から数値を読み取る際の避けられない曖昧さに対処するため、専門家定義の回答範囲を適宜設ける。我々は,OpenAI API経由でアクセスされるChatGPTやGPTモデルなど,最先端のマルチモーダルLLMを評価し,問題カテゴリやモデルバージョン間での性能を解析する。その結果、モデル更新によって全体的な精度が向上するが、現在のLLMは真の視覚的理解と、材料科学の数字の定量的解釈に苦戦していることが明らかとなった。多くの場合、提供された画像を読むよりも、記憶されたドメイン知識に頼って正しい回答が得られる。 MaterialFigBenchは、視覚的推論、数値精度、重要なデジタルハンドリングにおける永続的な弱点を強調し、パフォーマンスが改善された問題タイプを特定する。このベンチマークは、材料科学におけるマルチモーダル推論能力の進歩と、より強力な図形に基づく理解で将来のLLMの開発を導くための体系的でドメイン固有の基盤を提供する。

論文の概要: MaterialFigBENCH: benchmark dataset with figures for evaluating college-level materials science problem-solving abilities of multimodal large language models

関連論文リスト