AstroMLab 1: Who Wins Astronomy Jeopardy!?
- URL: http://arxiv.org/abs/2407.11194v2
- Date: Fri, 08 Nov 2024 22:00:26 GMT
- Title: AstroMLab 1: Who Wins Astronomy Jeopardy!?
- Authors: Yuan-Sen Ting, Tuan Dung Nguyen, Tirthankar Ghosal, Rui Pan, Hardik Arora, Zechang Sun, Tijmen de Haan, Nesar Ramachandra, Azton Wells, Sandeep Madireddy, Alberto Accomazzi,
- Abstract summary: This dataset comprises 4,425 multiple-choice questions curated from the Annual Review of Astronomy and Astrophysics.
Claude-3.5-Sonnet outperforms competitors by up to 4.6 percentage points, achieving 85.0% accuracy.
Open-weights models have rapidly improved, with LLaMA-3-70b (80.6%) and Qwen-2-72b (77.7%) now competing with some of the best proprietary models.
- Score: 4.162245706139047
- License:
- Abstract: We present a comprehensive evaluation of proprietary and open-weights large language models using the first astronomy-specific benchmarking dataset. This dataset comprises 4,425 multiple-choice questions curated from the Annual Review of Astronomy and Astrophysics, covering a broad range of astrophysical topics. Our analysis examines model performance across various astronomical subfields and assesses response calibration, crucial for potential deployment in research environments. Claude-3.5-Sonnet outperforms competitors by up to 4.6 percentage points, achieving 85.0% accuracy. For proprietary models, we observed a universal reduction in cost every 3-to-12 months to achieve similar score in this particular astronomy benchmark. open-weights models have rapidly improved, with LLaMA-3-70b (80.6%) and Qwen-2-72b (77.7%) now competing with some of the best proprietary models. We identify performance variations across topics, with non-English-focused models generally struggling more in exoplanet-related fields, stellar astrophysics, and instrumentation related questions. These challenges likely stem from less abundant training data, limited historical context, and rapid recent developments in these areas. This pattern is observed across both open-weights and proprietary models, with regional dependencies evident, highlighting the impact of training data diversity on model performance in specialized scientific domains. Top-performing models demonstrate well-calibrated confidence, with correlations above 0.9 between confidence and correctness, though they tend to be slightly underconfident. The development for fast, low-cost inference of open-weights models presents new opportunities for affordable deployment in astronomy. The rapid progress observed suggests that LLM-driven research in astronomy may become feasible in the near future.
Related papers
- Astromer 2 [1.236974227340167]
We introduce Astromer 2 as an enhanced iteration of our self-supervised model for light curve analysis.
Astromer 2 is pretrained on 1.5 million single-band light curves from the MACHO survey using a self-supervised learning task.
Our results demonstrate that Astromer 2 significantly outperforms Astromer 1 across all evaluated scenarios.
arXiv Detail & Related papers (2025-02-04T20:56:14Z) - ORBIT: Cost-Effective Dataset Curation for Large Language Model Domain Adaptation with an Astronomy Case Study [26.39743358097732]
ORBIT is a cost-efficient methodology for curating massive, high-quality domain-specific datasets from noisy web sources.
Fine-tuning textscLLaMA-3-8B on a 1B-token astronomy subset improved performance on the MMLU astronomy benchmark from 69% to 76%.
Our model (Orbit-LLaMA) outperformed textscLLaMA-3-8B-base, with GPT-4o evaluations preferring it in 73% of cases across 1000 astronomy-specific questions.
arXiv Detail & Related papers (2024-12-19T01:35:47Z) - Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models [63.31878920079154]
We propose a benchmark specifically designed to assess large language models' mathematical reasoning at the Olympiad level.
Unlike existing Olympiad-related benchmarks, our dataset focuses exclusively on mathematics and comprises a vast collection of 4428 competition-level problems with rigorous human annotation.
Our experimental results show that even the most advanced models, OpenAI o1-mini and OpenAI o1-preview, struggle with highly challenging Olympiad-level problems, with 60.54% and 52.55% accuracy, highlighting significant challenges in Olympiad-level mathematical reasoning.
arXiv Detail & Related papers (2024-10-10T14:39:33Z) - AstroMLab 2: AstroLLaMA-2-70B Model and Benchmarking Specialised LLMs for Astronomy [4.729846733874557]
This study aims to quantitatively assess specialized LLMs in astronomy.
We find that the previously released AstroLLaMA series, based on LLaMA-2-7B, underperforms compared to the base model.
Despite the observed catastrophic forgetting in smaller models, our results indicate that continual pretraining on the 70B model can yield significant improvements.
arXiv Detail & Related papers (2024-09-29T16:02:22Z) - Real-time gravitational-wave inference for binary neutron stars using machine learning [71.29593576787549]
We present a machine learning framework that performs complete BNS inference in just one second without making any approximations.
Our approach enhances multi-messenger observations by providing (i) accurate localization even before the merger; (ii) improved localization precision by $sim30%$ compared to approximate low-latency methods; and (iii) detailed information on luminosity distance, inclination, and masses.
arXiv Detail & Related papers (2024-07-12T18:00:02Z) - Deep Learning and LLM-based Methods Applied to Stellar Lightcurve Classification [7.592813175419603]
We present a comprehensive evaluation of deep-learning and large language model (LLM) based models for the automatic classification of variable star light curves.
Special emphasis is placed on Cepheids, RR Lyrae, and eclipsing binaries, examining the influence of observational cadence and phase distribution on classification precision.
We unveil StarWhisper LightCurve (LC), an innovative Series comprising three LLM-based models: LLM, multimodal large language model (MLLM), and Large Audio Language Model (LALM)
arXiv Detail & Related papers (2024-04-16T17:35:25Z) - Advancing LLM Reasoning Generalists with Preference Trees [119.57169648859707]
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning.
Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks.
arXiv Detail & Related papers (2024-04-02T16:25:30Z) - Neural Plasticity-Inspired Multimodal Foundation Model for Earth Observation [48.66623377464203]
Our novel approach introduces the Dynamic One-For-All (DOFA) model, leveraging the concept of neural plasticity in brain science.
This dynamic hypernetwork, adjusting to different wavelengths, enables a single versatile Transformer jointly trained on data from five sensors to excel across 12 distinct Earth observation tasks.
arXiv Detail & Related papers (2024-03-22T17:11:47Z) - How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts [54.07541591018305]
We present MAD-Bench, a benchmark that contains 1000 test samples divided into 5 categories, such as non-existent objects, count of objects, and spatial relationship.
We provide a comprehensive analysis of popular MLLMs, ranging from GPT-4v, Reka, Gemini-Pro, to open-sourced models, such as LLaVA-NeXT and MiniCPM-Llama3.
While GPT-4o achieves 82.82% accuracy on MAD-Bench, the accuracy of any other model in our experiments ranges from 9% to 50%.
arXiv Detail & Related papers (2024-02-20T18:31:27Z) - Simulation-based Inference for Exoplanet Atmospheric Retrieval: Insights
from winning the Ariel Data Challenge 2023 using Normalizing Flows [0.0]
We present novel machine learning models developed by the AstroAI team for the Ariel Data Challenge 2023.
One of the models secured the top position among 293 competitors.
We introduce an alternative model that exhibits higher performance potential than the winning model, despite scoring lower in the challenge.
arXiv Detail & Related papers (2023-09-17T17:59:59Z) - Supernova Light Curves Approximation based on Neural Network Models [53.180678723280145]
Photometric data-driven classification of supernovae becomes a challenge due to the appearance of real-time processing of big data in astronomy.
Recent studies have demonstrated the superior quality of solutions based on various machine learning models.
We study the application of multilayer perceptron (MLP), bayesian neural network (BNN), and normalizing flows (NF) to approximate observations for a single light curve.
arXiv Detail & Related papers (2022-06-27T13:46:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.