AstroMLab 1: Who Wins Astronomy Jeopardy!?
- URL: http://arxiv.org/abs/2407.11194v1
- Date: Mon, 15 Jul 2024 19:28:14 GMT
- Title: AstroMLab 1: Who Wins Astronomy Jeopardy!?
- Authors: Yuan-Sen Ting, Tuan Dung Nguyen, Tirthankar Ghosal, Rui Pan, Hardik Arora, Zechang Sun, Tijmen de Haan, Nesar Ramachandra, Azton Wells, Sandeep Madireddy, Alberto Accomazzi,
- Abstract summary: This dataset comprises 4,425 multiple-choice questions curated from the Annual Review of Astronomy and Astrophysics.
Claude-3.5-Sonnet outperforms competitors by up to 4.6 percentage points, achieving 85.0% accuracy.
Open-source models have rapidly improved, with LLaMA-3-70b (80.6%) and Qwen-2-72b (77.7%) now competing with some of the best proprietary models.
- Score: 4.162245706139047
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a comprehensive evaluation of proprietary and open-weights large language models using the first astronomy-specific benchmarking dataset. This dataset comprises 4,425 multiple-choice questions curated from the Annual Review of Astronomy and Astrophysics, covering a broad range of astrophysical topics. Our analysis examines model performance across various astronomical subfields and assesses response calibration, crucial for potential deployment in research environments. Claude-3.5-Sonnet outperforms competitors by up to 4.6 percentage points, achieving 85.0% accuracy. For proprietary models, we observed a universal reduction in cost every 3-to-12 months to achieve similar score in this particular astronomy benchmark. Open-source models have rapidly improved, with LLaMA-3-70b (80.6%) and Qwen-2-72b (77.7%) now competing with some of the best proprietary models. We identify performance variations across topics, with non-English-focused models generally struggling more in exoplanet-related fields, stellar astrophysics, and instrumentation related questions. These challenges likely stem from less abundant training data, limited historical context, and rapid recent developments in these areas. This pattern is observed across both open-weights and proprietary models, with regional dependencies evident, highlighting the impact of training data diversity on model performance in specialized scientific domains. Top-performing models demonstrate well-calibrated confidence, with correlations above 0.9 between confidence and correctness, though they tend to be slightly underconfident. The development for fast, low-cost inference of open-weights models presents new opportunities for affordable deployment in astronomy. The rapid progress observed suggests that LLM-driven research in astronomy may become feasible in the near future.
Related papers
- Real-time gravitational-wave inference for binary neutron stars using machine learning [71.29593576787549]
We develop a machine learning approach that performs complete BNS inference in just one second without making any approximations.
Our method scales to extremely long signals, up to an hour in length, thus serving as a blueprint for data analysis for next-generation ground- and space-based detectors.
arXiv Detail & Related papers (2024-07-12T18:00:02Z) - At First Sight: Zero-Shot Classification of Astronomical Images with Large Multimodal Models [0.0]
Vision-Language multimodal Models (VLMs) offer the possibility for zero-shot classification in astronomy.
We investigate two models, GPT-4o and LLaVA-NeXT, for zero-shot classification of low-surface brightness galaxies and artifacts.
We show that with natural language prompts these models achieved significant accuracy (above 80 percent typically) without additional training/fine tuning.
arXiv Detail & Related papers (2024-06-24T18:17:54Z) - Detecting and Classifying Flares in High-Resolution Solar Spectra with Supervised Machine Learning [0.0]
We present a standardized procedure to classify solar flares with the aid of supervised machine learning.
Using flare data from the RHESSI mission and solar spectra from the HARPS-N instrument, we trained several supervised machine learning models.
The best-trained model achieves an average aggregate accuracy score of 0.65, and categorical accuracy scores of over 0.70 for the no-flare and weak-flare classes.
arXiv Detail & Related papers (2024-06-21T18:52:03Z) - Aurora: A Foundation Model of the Atmosphere [56.97266186291677]
We introduce Aurora, a large-scale foundation model of the atmosphere trained on over a million hours of diverse weather and climate data.
In under a minute, Aurora produces 5-day global air pollution predictions and 10-day high-resolution weather forecasts.
arXiv Detail & Related papers (2024-05-20T14:45:18Z) - Deep Learning and LLM-based Methods Applied to Stellar Lightcurve Classification [7.592813175419603]
We present a comprehensive evaluation of deep-learning and large language model (LLM) based models for the automatic classification of variable star light curves.
Special emphasis is placed on Cepheids, RR Lyrae, and eclipsing binaries, examining the influence of observational cadence and phase distribution on classification precision.
We unveil StarWhisper LightCurve (LC), an innovative Series comprising three LLM-based models: LLM, multimodal large language model (MLLM), and Large Audio Language Model (LALM)
arXiv Detail & Related papers (2024-04-16T17:35:25Z) - Advancing LLM Reasoning Generalists with Preference Trees [119.57169648859707]
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning.
Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks.
arXiv Detail & Related papers (2024-04-02T16:25:30Z) - How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts [54.07541591018305]
We present MAD-Bench, a benchmark that contains 1000 test samples divided into 5 categories, such as non-existent objects, count of objects, and spatial relationship.
We provide a comprehensive analysis of popular MLLMs, ranging from GPT-4v, Reka, Gemini-Pro, to open-sourced models, such as LLaVA-NeXT and MiniCPM-Llama3.
While GPT-4o achieves 82.82% accuracy on MAD-Bench, the accuracy of any other model in our experiments ranges from 9% to 50%.
arXiv Detail & Related papers (2024-02-20T18:31:27Z) - MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large
Language Models [70.92847554971065]
We introduce MT-Eval, a comprehensive benchmark designed to evaluate multi-turn conversational abilities.
By analyzing human-LLM conversations, we categorize interaction patterns into four types: recollection, expansion, refinement, and follow-up.
Our evaluation of 11 well-known LLMs shows that while closed-source models generally surpass open-source ones, certain open-source models exceed GPT-3.5-Turbo in specific tasks.
arXiv Detail & Related papers (2024-01-30T04:50:28Z) - Simulation-based Inference for Exoplanet Atmospheric Retrieval: Insights
from winning the Ariel Data Challenge 2023 using Normalizing Flows [0.0]
We present novel machine learning models developed by the AstroAI team for the Ariel Data Challenge 2023.
One of the models secured the top position among 293 competitors.
We introduce an alternative model that exhibits higher performance potential than the winning model, despite scoring lower in the challenge.
arXiv Detail & Related papers (2023-09-17T17:59:59Z) - Supernova Light Curves Approximation based on Neural Network Models [53.180678723280145]
Photometric data-driven classification of supernovae becomes a challenge due to the appearance of real-time processing of big data in astronomy.
Recent studies have demonstrated the superior quality of solutions based on various machine learning models.
We study the application of multilayer perceptron (MLP), bayesian neural network (BNN), and normalizing flows (NF) to approximate observations for a single light curve.
arXiv Detail & Related papers (2022-06-27T13:46:51Z) - Improving Astronomical Time-series Classification via Data Augmentation
with Generative Adversarial Networks [1.2891210250935146]
We propose a data augmentation methodology based on Generative Adrial Networks (GANs) to generate a variety of synthetic light curves from variable stars.
The classification accuracy of variable stars is improved significantly when training with synthetic data and testing with real data.
arXiv Detail & Related papers (2022-05-13T16:39:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.