Related papers: AstroMMBench: A Benchmark for Evaluating Multimodal Large Language Models Capabilities in Astronomy

AstroMMBench: A Benchmark for Evaluating Multimodal Large Language Models Capabilities in Astronomy

URL: http://arxiv.org/abs/2510.00063v2
Date: Tue, 21 Oct 2025 17:29:47 GMT
Title: AstroMMBench: A Benchmark for Evaluating Multimodal Large Language Models Capabilities in Astronomy
Authors: Jinghang Shi, Xiaoyu Tang, Yang Huang, Yuyang Li, Xiao Kong, Yanxia Zhang, Caizhan Yue,
Abstract summary: We introduce AstroMMBench, the first comprehensive benchmark to evaluate multimodal large language models (MLLMs) in astronomical image understanding.<n>AstroMMBench comprises 621 multiple-choice questions across six astrophysical subfields, curated and reviewed by 15 domain experts for quality and relevance.<n>Results show that Ovis2-34B achieved the highest overall accuracy (70.5%), demonstrating leading capabilities even compared to strong closed-source models.
Score: 6.247581175023764
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Astronomical image interpretation presents a significant challenge for applying multimodal large language models (MLLMs) to specialized scientific tasks. Existing benchmarks focus on general multimodal capabilities but fail to capture the complexity of astronomical data. To bridge this gap, we introduce AstroMMBench, the first comprehensive benchmark designed to evaluate MLLMs in astronomical image understanding. AstroMMBench comprises 621 multiple-choice questions across six astrophysical subfields, curated and reviewed by 15 domain experts for quality and relevance. We conducted an extensive evaluation of 25 diverse MLLMs, including 22 open-source and 3 closed-source models, using AstroMMBench. The results show that Ovis2-34B achieved the highest overall accuracy (70.5%), demonstrating leading capabilities even compared to strong closed-source models. Performance showed variations across the six astrophysical subfields, proving particularly challenging in domains like cosmology and high-energy astrophysics, while models performed relatively better in others, such as instrumentation and solar astrophysics. These findings underscore the vital role of domain-specific benchmarks like AstroMMBench in critically evaluating MLLM performance and guiding their targeted development for scientific applications. AstroMMBench provides a foundational resource and a dynamic tool to catalyze advancements at the intersection of AI and astronomy.

Related papers

HiSciBench: A Hierarchical Multi-disciplinary Benchmark for Scientific Intelligence from Reading to Discovery [50.8841471967624]
HiSciBench is a hierarchical benchmark designed to evaluate foundation models across five levels that mirror the complete scientific workflow.<n>HiSciBench contains 8,735 carefully curated instances spanning six major scientific disciplines.
arXiv Detail & Related papers (2025-12-28T12:08:05Z)
Rethinking Facial Expression Recognition in the Era of Multimodal Large Language Models: Benchmark, Datasets, and Beyond [116.65158801881984]
We introduce post-training strategies aimed at enhancing the facial expression reasoning capabilities of MLLMs.<n>We develop a unified and interpretable FER foundation model termed UniFER-7B.
arXiv Detail & Related papers (2025-11-01T03:53:00Z)
BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities [61.173773299032746]
Embodied capabilities refer to a suite of fundamental abilities for an agent to perceive, comprehend, and interact with the physical world.<n>We introduce BEAR, a comprehensive and fine-grained benchmark that evaluates MLLMs on atomic embodied capabilities.<n> BEAR comprises 4,469 interleaved image-video-text entries across 14 domains in 6 categories, including tasks from low-level pointing, trajectory understanding, spatial reasoning, to high-level planning.<n>We propose BEAR-Agent, a multimodal conversable agent that integrates pretrained vision models to strengthen MLLM perception, 3D understanding, and planning capabilities.
arXiv Detail & Related papers (2025-10-09T19:18:36Z)
AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy [59.32718342798908]
We introduce AstroVisBench, the first benchmark for both scientific computing and visualization in the astronomy domain.<n>We present an evaluation of state-of-the-art language models, showing a significant gap in their ability to engage in astronomy research as useful assistants.
arXiv Detail & Related papers (2025-05-26T21:49:18Z)
AstroMLab 4: Benchmark-Topping Performance in Astronomy Q&A with a 70B-Parameter Domain-Specialized Reasoning Model [3.911100968725141]
General-purpose large language models often struggle with specialized domain knowledge.<n>This study introduces AstroSage-70B, a significantly larger and more advanced domain-specialized natural-language AI assistant.<n>It is designed for research and education across astronomy, astrophysics, space science, astroparticle physics, cosmology, and astronomical instrumentation.
arXiv Detail & Related papers (2025-05-23T07:58:50Z)
SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding [64.15606979785355]
Multimodal large language models (MLLMs) have achieved impressive success in question-answering tasks, yet their capabilities for spatial understanding are less explored.<n>This work investigates a critical question: do existing MLLMs possess 3D spatial perception and understanding abilities?
arXiv Detail & Related papers (2025-05-22T17:59:03Z)
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines [118.8024915014751]
Large language models (LLMs) have demonstrated remarkable proficiency in academic disciplines such as mathematics, physics, and computer science.<n>However, human knowledge encompasses over 200 specialized disciplines, far exceeding the scope of existing benchmarks.<n>We present SuperGPQA, a benchmark that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines.
arXiv Detail & Related papers (2025-02-20T17:05:58Z)
FedMLLM: Federated Fine-tuning MLLM on Multimodal Heterogeneity Data [56.08867996209236]
Fine-tuning Multimodal Large Language Models (MLLMs) with Federated Learning (FL) allows for expanding the training data scope by including private data sources.<n>We introduce a benchmark to evaluate the performance of federated fine-tuning of MLLMs across various multimodal heterogeneous scenarios.<n>We develop a general FedMLLM framework that integrates classic FL methods alongside two modality-agnostic strategies.
arXiv Detail & Related papers (2024-11-22T04:09:23Z)
AstroMLab 2: AstroLLaMA-2-70B Model and Benchmarking Specialised LLMs for Astronomy [4.729846733874557]
This study aims to quantitatively assess specialized LLMs in astronomy. We find that the previously released AstroLLaMA series, based on LLaMA-2-7B, underperforms compared to the base model. Despite the observed catastrophic forgetting in smaller models, our results indicate that continual pretraining on the 70B model can yield significant improvements.
arXiv Detail & Related papers (2024-09-29T16:02:22Z)
AstroMLab 1: Who Wins Astronomy Jeopardy!? [4.162245706139047]
This dataset comprises 4,425 multiple-choice questions curated from the Annual Review of Astronomy and Astrophysics. Claude-3.5-Sonnet outperforms competitors by up to 4.6 percentage points, achieving 85.0% accuracy. Open-weights models have rapidly improved, with LLaMA-3-70b (80.6%) and Qwen-2-72b (77.7%) now competing with some of the best proprietary models.
arXiv Detail & Related papers (2024-07-15T19:28:14Z)
AstroLLaMA-Chat: Scaling AstroLLaMA with Conversational and Diverse Datasets [7.53209156977206]
We explore the potential of enhancing LLM performance in astronomy-focused question-answering through targeted, continual pre-training. We achieve notable improvements in specialized topic comprehension using a curated set of astronomy corpora. We present an extension of AstroLLaMA: the fine-tuning of the 7B LLaMA model on a domain-specific conversational dataset, culminating in the release of the chat-enabled AstroLLaMA for community use.
arXiv Detail & Related papers (2024-01-03T04:47:02Z)
AstroLLaMA: Towards Specialized Foundation Models in Astronomy [1.1694367694169385]
We introduce AstroLLaMA, a 7-billion- parameter model fine-tuned from LLaMA-2 using over 300,000 astronomy abstracts from arXiv. Our model generates more insightful and scientifically relevant text completions and embedding extraction than state-of-the-arts foundation models. Its public release aims to spur astronomy-focused research, including automatic paper summarization and conversational agent development.
arXiv Detail & Related papers (2023-09-12T11:02:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.