AstroMLab 4: Benchmark-Topping Performance in Astronomy Q&A with a 70B-Parameter Domain-Specialized Reasoning Model
- URL: http://arxiv.org/abs/2505.17592v1
- Date: Fri, 23 May 2025 07:58:50 GMT
- Title: AstroMLab 4: Benchmark-Topping Performance in Astronomy Q&A with a 70B-Parameter Domain-Specialized Reasoning Model
- Authors: Tijmen de Haan, Yuan-Sen Ting, Tirthankar Ghosal, Tuan Dung Nguyen, Alberto Accomazzi, Emily Herron, Vanessa Lama, Rui Pan, Azton Wells, Nesar Ramachandra,
- Abstract summary: General-purpose large language models often struggle with specialized domain knowledge.<n>This study introduces AstroSage-70B, a significantly larger and more advanced domain-specialized natural-language AI assistant.<n>It is designed for research and education across astronomy, astrophysics, space science, astroparticle physics, cosmology, and astronomical instrumentation.
- Score: 3.911100968725141
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: General-purpose large language models, despite their broad capabilities, often struggle with specialized domain knowledge, a limitation particularly pronounced in more accessible, lower-parameter versions. This gap hinders their deployment as effective agents in demanding fields such as astronomy. Building on our prior work with AstroSage-8B, this study introduces AstroSage-70B, a significantly larger and more advanced domain-specialized natural-language AI assistant. It is designed for research and education across astronomy, astrophysics, space science, astroparticle physics, cosmology, and astronomical instrumentation. Developed from the Llama-3.1-70B foundation, AstroSage-70B underwent extensive continued pre-training on a vast corpus of astronomical literature, followed by supervised fine-tuning and model merging. Beyond its 70-billion parameter scale, this model incorporates refined datasets, judiciously chosen learning hyperparameters, and improved training procedures, achieving state-of-the-art performance on complex astronomical tasks. Notably, we integrated reasoning chains into the SFT dataset, enabling AstroSage-70B to either answer the user query immediately, or first emit a human-readable thought process. Evaluated on the AstroMLab-1 benchmark -- comprising 4,425 questions from literature withheld during training -- AstroSage-70B achieves state-of-the-art performance. It surpasses all other tested open-weight and proprietary models, including leading systems like o3, Gemini-2.5-Pro, Claude-3.7-Sonnet, Deepseek-R1, and Qwen-3-235B, even those with API costs two orders of magnitude higher. This work demonstrates that domain specialization, when applied to large-scale models, can enable them to outperform generalist counterparts in specialized knowledge areas like astronomy, thereby advancing the frontier of AI capabilities in the field.
Related papers
- STAR: A Benchmark for Astronomical Star Fields Super-Resolution [51.79340280382437]
We propose STAR, a large-scale astronomical SR dataset containing 54,738 flux-consistent star field image pairs.<n>We propose a Flux-Invariant Super Resolution (FISR) model that could accurately infer the flux-consistent high-resolution images from input photometry.
arXiv Detail & Related papers (2025-07-22T09:28:28Z) - AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy [59.32718342798908]
We introduce AstroVisBench, the first benchmark for both scientific computing and visualization in the astronomy domain.<n>We present an evaluation of state-of-the-art language models, showing a significant gap in their ability to engage in astronomy research as useful assistants.
arXiv Detail & Related papers (2025-05-26T21:49:18Z) - ORBIT: Cost-Effective Dataset Curation for Large Language Model Domain Adaptation with an Astronomy Case Study [26.39743358097732]
ORBIT is a cost-efficient methodology for curating massive, high-quality domain-specific datasets from noisy web sources.<n>Fine-tuning textscLLaMA-3-8B on a 1B-token astronomy subset improved performance on the MMLU astronomy benchmark from 69% to 76%.<n>Our model (Orbit-LLaMA) outperformed textscLLaMA-3-8B-base, with GPT-4o evaluations preferring it in 73% of cases across 1000 astronomy-specific questions.
arXiv Detail & Related papers (2024-12-19T01:35:47Z) - AstroMLab 2: AstroLLaMA-2-70B Model and Benchmarking Specialised LLMs for Astronomy [4.729846733874557]
This study aims to quantitatively assess specialized LLMs in astronomy.
We find that the previously released AstroLLaMA series, based on LLaMA-2-7B, underperforms compared to the base model.
Despite the observed catastrophic forgetting in smaller models, our results indicate that continual pretraining on the 70B model can yield significant improvements.
arXiv Detail & Related papers (2024-09-29T16:02:22Z) - AstroMLab 1: Who Wins Astronomy Jeopardy!? [4.162245706139047]
This dataset comprises 4,425 multiple-choice questions curated from the Annual Review of Astronomy and Astrophysics.
Claude-3.5-Sonnet outperforms competitors by up to 4.6 percentage points, achieving 85.0% accuracy.
Open-weights models have rapidly improved, with LLaMA-3-70b (80.6%) and Qwen-2-72b (77.7%) now competing with some of the best proprietary models.
arXiv Detail & Related papers (2024-07-15T19:28:14Z) - SpaRC and SpaRP: Spatial Reasoning Characterization and Path Generation for Understanding Spatial Reasoning Capability of Large Language Models [70.01883340129204]
spatial reasoning is a crucial component of both biological and artificial intelligence.
We present a comprehensive study of the capability of current state-of-the-art large language models (LLMs) on spatial reasoning.
arXiv Detail & Related papers (2024-06-07T01:06:34Z) - Large Models for Time Series and Spatio-Temporal Data: A Survey and
Outlook [95.32949323258251]
Temporal data, notably time series andtemporal-temporal data, are prevalent in real-world applications.
Recent advances in large language and other foundational models have spurred increased use in time series andtemporal data mining.
arXiv Detail & Related papers (2023-10-16T09:06:00Z) - AstroLLaMA: Towards Specialized Foundation Models in Astronomy [1.1694367694169385]
We introduce AstroLLaMA, a 7-billion- parameter model fine-tuned from LLaMA-2 using over 300,000 astronomy abstracts from arXiv.
Our model generates more insightful and scientifically relevant text completions and embedding extraction than state-of-the-arts foundation models.
Its public release aims to spur astronomy-focused research, including automatic paper summarization and conversational agent development.
arXiv Detail & Related papers (2023-09-12T11:02:27Z) - Selected Trends in Artificial Intelligence for Space Applications [69.3474006357492]
This chapter focuses on differentiable intelligence and on-board machine learning.
We discuss a few selected projects originating from the European Space Agency's (ESA) Advanced Concepts Team (ACT)
arXiv Detail & Related papers (2022-12-10T07:49:50Z) - Applications of AI in Astronomy [0.0]
We provide an overview of the use of Machine Learning (ML) and other AI methods in astronomy, astrophysics, and cosmology.
Over the past decade we have seen an exponential growth of the astronomical literature involving a variety of ML/AI applications.
As the data complexity continues to increase, we anticipate further advances leading towards a collaborative human-AI discovery.
arXiv Detail & Related papers (2022-12-03T00:38:59Z) - Semi-Supervised Domain Adaptation for Cross-Survey Galaxy Morphology
Classification and Anomaly Detection [57.85347204640585]
We develop a Universal Domain Adaptation method DeepAstroUDA.
It can be applied to datasets with different types of class overlap.
For the first time, we demonstrate the successful use of domain adaptation on two very different observational datasets.
arXiv Detail & Related papers (2022-11-01T18:07:21Z) - Satellite Image Time Series Analysis for Big Earth Observation Data [50.591267188664666]
This paper describes sits, an open-source R package for satellite image time series analysis using machine learning.
We show that this approach produces high accuracy for land use and land cover maps through a case study in the Cerrado biome.
arXiv Detail & Related papers (2022-04-24T15:23:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.