AstroMLab 1: Who Wins Astronomy Jeopardy!?
- URL: http://arxiv.org/abs/2407.11194v1
- Date: Mon, 15 Jul 2024 19:28:14 GMT
- Title: AstroMLab 1: Who Wins Astronomy Jeopardy!?
- Authors: Yuan-Sen Ting, Tuan Dung Nguyen, Tirthankar Ghosal, Rui Pan, Hardik Arora, Zechang Sun, Tijmen de Haan, Nesar Ramachandra, Azton Wells, Sandeep Madireddy, Alberto Accomazzi,
- Abstract summary: This dataset comprises 4,425 multiple-choice questions curated from the Annual Review of Astronomy and Astrophysics.
Claude-3.5-Sonnet outperforms competitors by up to 4.6 percentage points, achieving 85.0% accuracy.
Open-source models have rapidly improved, with LLaMA-3-70b (80.6%) and Qwen-2-72b (77.7%) now competing with some of the best proprietary models.
- Score: 4.162245706139047
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a comprehensive evaluation of proprietary and open-weights large language models using the first astronomy-specific benchmarking dataset. This dataset comprises 4,425 multiple-choice questions curated from the Annual Review of Astronomy and Astrophysics, covering a broad range of astrophysical topics. Our analysis examines model performance across various astronomical subfields and assesses response calibration, crucial for potential deployment in research environments. Claude-3.5-Sonnet outperforms competitors by up to 4.6 percentage points, achieving 85.0% accuracy. For proprietary models, we observed a universal reduction in cost every 3-to-12 months to achieve similar score in this particular astronomy benchmark. Open-source models have rapidly improved, with LLaMA-3-70b (80.6%) and Qwen-2-72b (77.7%) now competing with some of the best proprietary models. We identify performance variations across topics, with non-English-focused models generally struggling more in exoplanet-related fields, stellar astrophysics, and instrumentation related questions. These challenges likely stem from less abundant training data, limited historical context, and rapid recent developments in these areas. This pattern is observed across both open-weights and proprietary models, with regional dependencies evident, highlighting the impact of training data diversity on model performance in specialized scientific domains. Top-performing models demonstrate well-calibrated confidence, with correlations above 0.9 between confidence and correctness, though they tend to be slightly underconfident. The development for fast, low-cost inference of open-weights models presents new opportunities for affordable deployment in astronomy. The rapid progress observed suggests that LLM-driven research in astronomy may become feasible in the near future.
Related papers
- Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models [63.31878920079154]
We propose a benchmark specifically designed to assess large language models' mathematical reasoning at the Olympiad level.
Unlike existing Olympiad-related benchmarks, our dataset focuses exclusively on mathematics.
Our experimental results show that even the most advanced models, OpenAI o1-mini and OpenAI o1-preview, struggle with highly challenging Olympiad-level problems.
arXiv Detail & Related papers (2024-10-10T14:39:33Z) - AstroMLab 2: AstroLLaMA-2-70B Model and Benchmarking Specialised LLMs for Astronomy [4.729846733874557]
This study aims to quantitatively assess specialized LLMs in astronomy.
We find that the previously released AstroLLaMA series, based on LLaMA-2-7B, underperforms compared to the base model.
Despite the observed catastrophic forgetting in smaller models, our results indicate that continual pretraining on the 70B model can yield significant improvements.
arXiv Detail & Related papers (2024-09-29T16:02:22Z) - Maven: A Multimodal Foundation Model for Supernova Science [40.20166238855543]
We present Maven, the first foundation model for supernova science.
We first pre-train our model to align photometry and spectroscopy from 0.5M synthetic supernovae.
We then fine-tune the model on 4,702 observed supernovae from the Zwicky Transient Facility.
arXiv Detail & Related papers (2024-08-29T18:00:05Z) - Real-time gravitational-wave inference for binary neutron stars using machine learning [71.29593576787549]
We present a machine learning framework that performs complete BNS inference in just one second without making any approximations.
Our approach enhances multi-messenger observations by providing (i) accurate localization even before the merger; (ii) improved localization precision by $sim30%$ compared to approximate low-latency methods; and (iii) detailed information on luminosity distance, inclination, and masses.
arXiv Detail & Related papers (2024-07-12T18:00:02Z) - At First Sight: Zero-Shot Classification of Astronomical Images with Large Multimodal Models [0.0]
Vision-Language multimodal Models (VLMs) offer the possibility for zero-shot classification in astronomy.
We investigate two models, GPT-4o and LLaVA-NeXT, for zero-shot classification of low-surface brightness galaxies and artifacts.
We show that with natural language prompts these models achieved significant accuracy (above 80 percent typically) without additional training/fine tuning.
arXiv Detail & Related papers (2024-06-24T18:17:54Z) - Aurora: A Foundation Model of the Atmosphere [56.97266186291677]
We introduce Aurora, a large-scale foundation model of the atmosphere trained on over a million hours of diverse weather and climate data.
In under a minute, Aurora produces 5-day global air pollution predictions and 10-day high-resolution weather forecasts.
arXiv Detail & Related papers (2024-05-20T14:45:18Z) - Deep Learning and LLM-based Methods Applied to Stellar Lightcurve Classification [7.592813175419603]
We present a comprehensive evaluation of deep-learning and large language model (LLM) based models for the automatic classification of variable star light curves.
Special emphasis is placed on Cepheids, RR Lyrae, and eclipsing binaries, examining the influence of observational cadence and phase distribution on classification precision.
We unveil StarWhisper LightCurve (LC), an innovative Series comprising three LLM-based models: LLM, multimodal large language model (MLLM), and Large Audio Language Model (LALM)
arXiv Detail & Related papers (2024-04-16T17:35:25Z) - Advancing LLM Reasoning Generalists with Preference Trees [119.57169648859707]
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning.
Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks.
arXiv Detail & Related papers (2024-04-02T16:25:30Z) - How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts [54.07541591018305]
We present MAD-Bench, a benchmark that contains 1000 test samples divided into 5 categories, such as non-existent objects, count of objects, and spatial relationship.
We provide a comprehensive analysis of popular MLLMs, ranging from GPT-4v, Reka, Gemini-Pro, to open-sourced models, such as LLaVA-NeXT and MiniCPM-Llama3.
While GPT-4o achieves 82.82% accuracy on MAD-Bench, the accuracy of any other model in our experiments ranges from 9% to 50%.
arXiv Detail & Related papers (2024-02-20T18:31:27Z) - Simulation-based Inference for Exoplanet Atmospheric Retrieval: Insights
from winning the Ariel Data Challenge 2023 using Normalizing Flows [0.0]
We present novel machine learning models developed by the AstroAI team for the Ariel Data Challenge 2023.
One of the models secured the top position among 293 competitors.
We introduce an alternative model that exhibits higher performance potential than the winning model, despite scoring lower in the challenge.
arXiv Detail & Related papers (2023-09-17T17:59:59Z) - Supernova Light Curves Approximation based on Neural Network Models [53.180678723280145]
Photometric data-driven classification of supernovae becomes a challenge due to the appearance of real-time processing of big data in astronomy.
Recent studies have demonstrated the superior quality of solutions based on various machine learning models.
We study the application of multilayer perceptron (MLP), bayesian neural network (BNN), and normalizing flows (NF) to approximate observations for a single light curve.
arXiv Detail & Related papers (2022-06-27T13:46:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.