Related papers: AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy

AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy

URL: http://arxiv.org/abs/2505.20538v3
Date: Tue, 03 Jun 2025 18:56:38 GMT
Title: AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy
Authors: Sebastian Antony Joseph, Syed Murtaza Husain, Stella S. R. Offner, Stéphanie Juneau, Paul Torrey, Adam S. Bolton, Juan P. Farias, Niall Gaffney, Greg Durrett, Junyi Jessy Li,
Abstract summary: We introduce AstroVisBench, the first benchmark for both scientific computing and visualization in the astronomy domain.<n>We present an evaluation of state-of-the-art language models, showing a significant gap in their ability to engage in astronomy research as useful assistants.
Score: 59.32718342798908
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are being explored for applications in scientific research, including their capabilities to synthesize literature, answer research questions, generate research ideas, and even conduct computational experiments. Ultimately, our goal is for these to help scientists derive novel scientific insights. In many areas of science, such insights often arise from processing and visualizing data to understand its patterns. However, evaluating whether an LLM-mediated scientific workflow produces outputs conveying the correct scientific insights is challenging to evaluate and has not been addressed in past work. We introduce AstroVisBench, the first benchmark for both scientific computing and visualization in the astronomy domain. AstroVisBench judges a language model's ability to both (1) create astronomy-specific workflows to process and analyze data and (2) visualize the results of these workflows through complex plots. Our evaluation of visualizations uses a novel LLM-as-a-judge workflow, which is validated against annotation by five professional astronomers. Using AstroVisBench we present an evaluation of state-of-the-art language models, showing a significant gap in their ability to engage in astronomy research as useful assistants. This evaluation provides a strong end-to-end evaluation for AI scientists that offers a path forward for the development of visualization-based workflows, which are central to a broad range of domains from physics to biology.

Related papers

ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows [82.07367406991678]
Large Language Models (LLMs) have extended their impact beyond Natural Language Processing.<n>Among these, computer-using agents are capable of interacting with operating systems as humans do.<n>We introduce ScienceBoard, which encompasses a realistic, multi-domain environment featuring dynamic and visually rich scientific software.
arXiv Detail & Related papers (2025-05-26T12:27:27Z)
pathfinder: A Semantic Framework for Literature Review and Knowledge Discovery in Astronomy [2.6952253149772996]
Pathfinder is a machine learning framework designed to enable literature review and knowledge discovery in astronomy. Our framework couples advanced retrieval techniques with LLM-based synthesis to search astronomical literature by semantic context. It addresses complexities of jargon, named entities, and temporal aspects through time-based and citation-based weighting schemes.
arXiv Detail & Related papers (2024-08-02T20:05:24Z)
A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery [68.48094108571432]
Large language models (LLMs) have revolutionized the way text and other modalities of data are handled. We aim to provide a more holistic view of the research landscape by unveiling cross-field and cross-modal connections between scientific LLMs.
arXiv Detail & Related papers (2024-06-16T08:03:24Z)
SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks. SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z)
MASSW: A New Dataset and Benchmark Tasks for AI-Assisted Scientific Workflows [58.56005277371235]
We introduce MASSW, a comprehensive text dataset on Multi-Aspect Summarization of ScientificAspects. MASSW includes more than 152,000 peer-reviewed publications from 17 leading computer science conferences spanning the past 50 years. We demonstrate the utility of MASSW through multiple novel machine-learning tasks that can be benchmarked using this new dataset.
arXiv Detail & Related papers (2024-06-10T15:19:09Z)
Scientific Large Language Models: A Survey on Biological & Chemical Domains [47.97810890521825]
Large Language Models (LLMs) have emerged as a transformative power in enhancing natural language comprehension. The application of LLMs extends beyond conventional linguistic boundaries, encompassing specialized linguistic systems developed within various scientific disciplines. As a burgeoning area in the community of AI for Science, scientific LLMs warrant comprehensive exploration.
arXiv Detail & Related papers (2024-01-26T05:33:34Z)
SciInstruct: a Self-Reflective Instruction Annotated Dataset for Training Scientific Language Models [57.96527452844273]
We introduce SciInstruct, a suite of scientific instructions for training scientific language models capable of college-level scientific reasoning. We curated a diverse and high-quality dataset encompassing physics, chemistry, math, and formal proofs. To verify the effectiveness of SciInstruct, we fine-tuned different language models with SciInstruct, i.e., ChatGLM3 (6B and 32B), Llama3-8B-Instruct, and Mistral-7B: MetaMath.
arXiv Detail & Related papers (2024-01-15T20:22:21Z)
Assessing Exoplanet Habitability through Data-driven Approaches: A Comprehensive Literature Review [0.0]
Review aims to illuminate the emerging trends and advancements within exoplanet research. Focuses on interplay between exoplanet detection, classification, and visualization. Describes the broad spectrum of machine learning approaches employed in exoplanet research.
arXiv Detail & Related papers (2023-05-18T17:18:15Z)
Workflow Provenance in the Lifecycle of Scientific Machine Learning [1.6118907823528272]
We leverage workflow techniques to build a holistic view to support the lifecycle of scientific ML. We contribute with (i) characterization of the lifecycle and taxonomy for data analyses; (ii) design principles to build this view, with a W3C PROV compliant data representation and a reference system architecture; and (iii) lessons learned after an evaluation in an Oil & Gas case using an HPC cluster with 393 nodes and 946 GPUs.
arXiv Detail & Related papers (2020-09-30T13:09:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.