Fugu-MT 論文翻訳(概要): A systematic evaluation of vision-language models for observational astronomical reasoning tasks

論文の概要: A systematic evaluation of vision-language models for observational astronomical reasoning tasks

arxiv url: http://arxiv.org/abs/2604.24589v1
Date: Mon, 27 Apr 2026 15:11:31 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-28 17:12:08.114157
Title: A systematic evaluation of vision-language models for observational astronomical reasoning tasks
Title（参考訳）: 観測天文学的推論タスクのための視覚言語モデルの体系的評価
Authors: Wenke Ren, Hengxiao Guo, Wenwen Zuo, Xiaoman Zhang,
Abstract要約: 視覚言語モデル(VLM)は、科学データ解釈のための汎用ツールとしてますます提案されているが、実際の天文学的な観測に対する信頼性は証明されていない。 AstroVLBenchは、光学画像、電波干渉計、マルチ波長光度測定、時間領域光度曲線、光学分光の5つのタスクにまたがる4,100以上の専門家検証インスタンスからなる総合的なベンチマークである。
参考スコア（独自算出の注目度）: 3.608879177162859
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-language models (VLMs) are increasingly proposed as general-purpose tools for scientific data interpretation, yet their reliability on real astronomical observations across diverse modalities remains untested. We present AstroVLBench, a comprehensive benchmark comprising over 4,100 expert-verified instances across five tasks spanning optical imaging, radio interferometry, multi-wavelength photometry, time-domain light curves, and optical spectroscopy. Evaluating six frontier models, we find that performance is strongly modality-dependent: while one model (Gemini 3 Pro) emerges as the most consistently capable across tasks, task-specific strengths vary, and all models substantially underperform domain-specialized methods. Mechanistic ablations reveal that performance depends not only on directing attention to salient visual features but also on grounding those features in physical knowledge. Phenomenological prompts describing what to look for improve accuracy by sharpening model focus, but physical prompts explaining why those features matter perform better overall and yield more balanced classifications with reduced class-specific bias. Consistent with this picture, presenting the underlying one-dimensional measurements directly as numerical tables instead of rendered plots yields up to 13 percentage points improvement. Reasoning quality analysis further demonstrates that, without explicit physical grounding, models may reach correct predictions from phenomenologically plausible cues while providing physically imprecise justifications, establishing that accuracy alone is insufficient for trustworthy scientific deployment. These findings provide the first systematic, multi-modal baselines for VLMs in observational astronomy and identify the specific representation, grounding, and reasoning bottlenecks where current models fail.
Abstract（参考訳）: 視覚言語モデル(VLM)は、科学データ解釈のための汎用ツールとしてますます提案されているが、様々なモードにわたる実際の天文観測に対する信頼性は検証されていない。 AstroVLBenchは、光学画像、電波干渉計、マルチ波長光度測定、時間領域光度曲線、光学分光の5つのタスクにまたがる4,100以上の専門家検証インスタンスからなる総合的なベンチマークである。 6つのフロンティアモデルを評価すると、性能は強いモダリティに依存していることが分かる。一方のモデル(Gemini 3 Pro)はタスク間で最も一貫した能力を持つものとして出現するが、タスク固有の強みは様々であり、全てのモデルはドメイン特化メソッドを実質的に過小評価する。メカニスティック・アブレーション(Mechanistic ablations)は、パフォーマンスは、健全な視覚的特徴に注意を向けるだけでなく、それらの特徴を物理的に理解することにも依存していることを示している。現象学的には、モデル焦点を研ぎ澄ますことで精度を向上させる方法を説明するが、物理的には、これらの特徴が全体的な性能を向上し、クラス固有のバイアスを減らしてよりバランスの取れた分類をもたらす理由を説明する。この図に従えば、描画プロットの代わりに、基礎となる1次元測定を数値表として直接提示すると、最大13パーセントの改善が得られる。推論品質分析は、明示的な物理的根拠がなければ、モデルは、物理的に不正確な正当化を提供しながら、現象学的に妥当な手がかりから正しい予測に達することを証明し、信頼できる科学的展開には正確性だけでは不十分であることを示す。これらの発見は、観測天文学におけるVLMの最初の体系的、マルチモーダルなベースラインを提供し、現在のモデルが失敗する特定の表現、接地、推論のボトルネックを特定する。

論文の概要: A systematic evaluation of vision-language models for observational astronomical reasoning tasks

関連論文リスト