Fugu-MT 論文翻訳(概要): AstroAlertBench: Evaluating the Accuracy, Reasoning, and Honesty of Multimodal LLMs in Astronomical Classification

論文の概要: AstroAlertBench: Evaluating the Accuracy, Reasoning, and Honesty of Multimodal LLMs in Astronomical Classification

arxiv url: http://arxiv.org/abs/2605.05573v1
Date: Thu, 07 May 2026 01:36:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-08 22:27:11.478594
Title: AstroAlertBench: Evaluating the Accuracy, Reasoning, and Honesty of Multimodal LLMs in Astronomical Classification
Title（参考訳）: AstroAlertBench:天文学分類におけるマルチモーダルLDMの正確性、推論、および正直性の評価
Authors: Claire Chen, Jiabao Sean Xiao, Shuze Daniel Liu, Facundo Perez Paolino, Luke Handley, Theophile Jegou du Laz, Ricky Nilsson, Alice Zou, Matthew Graham, Ashish Mahabal,
Abstract要約: AstroAlertBenchは、天文学的なイベントレビューのために大きな言語モデル(LLM)を評価するために設計された包括的なベンチマークである。我々は、北部の空をスキャンして一過性の天文事象を検出する広視野調査であるZTF(Zwicky Transient Facility)から、1500件の現実世界の警報のパイロットサンプルを使用します。以上の結果から,高い精度は必ずしもモデルの誠実さと一致しないことが明らかとなり,その推論を自己評価する能力として定義されている。
参考スコア（独自算出の注目度）: 6.546448267229169
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern astronomical observatories generate a massive volume of multimodal data, creating a critical bottleneck for expert human review. While multimodal large language models (LLMs) have shown promise in interpreting complex visual and textual inputs, their ability to perform specialized scientific classification while providing interpretable reasoning remains understudied. We introduce AstroAlertBench, a comprehensive multimodal benchmark designed to evaluate LLM performance in astronomical event review along a three-stage logical chain: metadata grounding, scientific reasoning, and hierarchical classification over five categories. We use a pilot sample of 1,500 real-world alerts from the Zwicky Transient Facility (ZTF), a wide-field survey that scans the northern sky to detect transient astronomical events. On this dataset, we benchmark 13 frontier closed-source and open-weight LLMs that support visual input. Our results reveal that high accuracy does not always align with model ``honesty,'' defined as the ability to self-evaluate its reasoning, which affects its reliability as a real-world assistant. We further initialize a human-in-the-loop evaluation protocol as a precursor to future community-scale participation. Together, AstroAlertBench provides a framework for developing calibrated and interpretable astronomical assistants.
Abstract（参考訳）: 現代の天文学の観測所は膨大な量のマルチモーダルデータを生成し、専門家の人間のレビューにとって重要なボトルネックを生み出している。マルチモーダルな大言語モデル(LLM)は複雑な視覚的およびテキスト的入力の解釈において有望であるが、解釈可能な推論を提供しながら専門的な科学的分類を行う能力はいまだ検討されていない。 AstroAlertBenchは、天文学的なイベントレビューにおいて、メタデータグラウンディング、科学的推論、階層分類という3段階の論理的連鎖に沿って、LCMのパフォーマンスを評価するために設計された総合的なマルチモーダルベンチマークである。我々は、北部の空をスキャンして一過性の天文事象を検出する広視野調査であるZTF(Zwicky Transient Facility)から、1500件の現実世界の警報のパイロットサンプルを使用します。このデータセットでは、ビジュアル入力をサポートする13のフロンティアクローズドソースとオープンウェイトLLMをベンチマークする。以上の結果から,高い精度は,現実のアシスタントとしての信頼性に影響を与える推論を自己評価する能力として定義されるモデル ‘honesty,’ と一致しないことが明らかとなった。我々はさらに,今後のコミュニティ規模の参加の先駆けとして,人間によるループ評価プロトコルを初期化する。 AstroAlertBenchは、校正され解釈可能な天文アシスタントを開発するためのフレームワークを提供する。

論文の概要: AstroAlertBench: Evaluating the Accuracy, Reasoning, and Honesty of Multimodal LLMs in Astronomical Classification

関連論文リスト