Fugu-MT 論文翻訳(概要): Large Language Models Achieve Gold Medal Performance at the International Olympiad on Astronomy & Astrophysics (IOAA)

論文の概要: Large Language Models Achieve Gold Medal Performance at the International Olympiad on Astronomy & Astrophysics (IOAA)

arxiv url: http://arxiv.org/abs/2510.05016v2
Date: Tue, 07 Oct 2025 15:34:59 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-08 13:19:51.503901
Title: Large Language Models Achieve Gold Medal Performance at the International Olympiad on Astronomy & Astrophysics (IOAA)
Title（参考訳）: 国際天文学・天体物理学会(IOAA)の金メダル獲得のための大規模言語モデル
Authors: Lucas Carrit Delgado Pinheiro, Ziru Chen, Bruno Caixeta Piazza, Ness Shroff, Yingbin Liang, Yuan-Sen Ting, Huan Sun,
Abstract要約: 我々は,国際天文学・天体物理学試験(IOAA)において,5つの大きな言語モデル(LLM)をベンチマークした。平均スコアは85.6%、84.2%で、ジェミニ2.5 ProとGPT-5は4つのIOAA理論試験で200-300人中上位2位にランクインした。 GPT-5は88.5%のスコアで試験に合格しており、最新の4つのIOAAの参加者の中ではトップ10にランクインしている。
参考スコア（独自算出の注目度）: 43.53870250026015
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While task-specific demonstrations show early success in applying large language models (LLMs) to automate some astronomical research tasks, they only provide incomplete views of all necessary capabilities in solving astronomy problems, calling for more thorough understanding of LLMs' strengths and limitations. So far, existing benchmarks and evaluations focus on simple question-answering that primarily tests astronomical knowledge and fails to evaluate the complex reasoning required for real-world research in the discipline. Here, we address this gap by systematically benchmarking five state-of-the-art LLMs on the International Olympiad on Astronomy and Astrophysics (IOAA) exams, which are designed to examine deep conceptual understanding, multi-step derivations, and multimodal analysis. With average scores of 85.6% and 84.2%, Gemini 2.5 Pro and GPT-5 (the two top-performing models) not only achieve gold medal level performance but also rank in the top two among ~200-300 participants in all four IOAA theory exams evaluated (2022-2025). In comparison, results on the data analysis exams show more divergence. GPT-5 still excels in the exams with an 88.5% average score, ranking top 10 among the participants in the four most recent IOAAs, while other models' performances drop to 48-76%. Furthermore, our in-depth error analysis underscores conceptual reasoning, geometric reasoning, and spatial visualization (52-79% accuracy) as consistent weaknesses among all LLMs. Hence, although LLMs approach peak human performance in theory exams, critical gaps must be addressed before they can serve as autonomous research agents in astronomy.
Abstract（参考訳）: タスク固有のデモンストレーションは、いくつかの天文学研究タスクを自動化するために大きな言語モデル(LLM)を適用した初期の成功を示しているが、天文学の問題を解く上で必要なすべての能力について不完全な見解しか示さず、LSMの強みと限界をより深く理解するよう要求している。これまでのベンチマークや評価では、天文学的な知識を主にテストし、その分野における現実世界の研究に必要な複雑な推論を評価できない、単純な質問回答に焦点が当てられている。ここでは,このギャップを,深い概念的理解,多段階の導出,マルチモーダル分析を目的とした国際天文学・天体物理学国際オリンピック(IOAA)試験の5つの最先端LCMを体系的にベンチマークすることで解決する。平均スコアは85.6%と84.2%で、ジェミニ2.5 ProとGPT-5は金メダルレベルの成績を達成しただけでなく、4つのIOAA理論試験(2022-2025)で200-300人中上位2位にランクインした。比較してデータ分析試験の結果は, よりばらつきが強い。 GPT-5は88.5%のスコアで試験に合格し、最新の4つのIOAAの参加者のうちトップ10にランクインし、他のモデルの成績は48-76%に低下した。さらに, 奥行き誤差解析は, 概念的推論, 幾何学的推論, 空間的可視化(52-79%の精度)を, 全LSMにおいて一貫した弱点として評価する。したがって、LLMは理論試験における人間のパフォーマンスのピークに近づいているが、天文学における自律的な研究エージェントとして機能する前には、致命的なギャップに対処する必要がある。

論文の概要: Large Language Models Achieve Gold Medal Performance at the International Olympiad on Astronomy & Astrophysics (IOAA)

関連論文リスト