AtmosSci-Bench: Evaluating the Recent Advance of Large Language Model for Atmospheric Science
- URL: http://arxiv.org/abs/2502.01159v1
- Date: Mon, 03 Feb 2025 08:50:46 GMT
- Title: AtmosSci-Bench: Evaluating the Recent Advance of Large Language Model for Atmospheric Science
- Authors: Chenyue Li, Wen Deng, Mengqian Lu, Binhang Yuan,
- Abstract summary: We present a novel benchmark designed to assess large language models (LLMs) across five core categories of atmospheric science problems.
We employ a template-based question generation framework, enabling scalable and diverse multiple-choice questions curated from graduate-level atmospheric science problems.
Our analysis provides some interesting insights into the reasoning and problem-solving capabilities of LLMs in atmospheric science.
- Score: 2.7804525903465964
- License:
- Abstract: The rapid advancements in large language models (LLMs), particularly in their reasoning capabilities, hold transformative potential for addressing complex challenges in atmospheric science. However, leveraging LLMs effectively in this domain requires a robust and comprehensive evaluation benchmark. To address this need, we present AtmosSci-Bench, a novel benchmark designed to systematically assess LLM performance across five core categories of atmospheric science problems: hydrology, atmospheric dynamics, atmospheric physics, geophysics, and physical oceanography. We employ a template-based question generation framework, enabling scalable and diverse multiple-choice questions curated from graduate-level atmospheric science problems. We conduct a comprehensive evaluation of representative LLMs, categorized into four groups: instruction-tuned models, advanced reasoning models, math-augmented models, and domain-specific climate models. Our analysis provides some interesting insights into the reasoning and problem-solving capabilities of LLMs in atmospheric science. We believe AtmosSci-Bench can serve as a critical step toward advancing LLM applications in climate service by offering a standard and rigorous evaluation framework. Our source codes are currently available at https://github.com/Relaxed-System-Lab/AtmosSci-Bench.
Related papers
- Scaling Autonomous Agents via Automatic Reward Modeling And Planning [52.39395405893965]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of tasks.
However, they still struggle with problems requiring multi-step decision-making and environmental feedback.
We propose a framework that can automatically learn a reward model from the environment without human annotations.
arXiv Detail & Related papers (2025-02-17T18:49:25Z) - Physics-Guided Foundation Model for Scientific Discovery: An Application to Aquatic Science [13.28811382673697]
We propose a textittextbfPhysics-textbfGuided textbfFoundation textbfModel (textbfPGFM) that combines pre-trained ML models and physics-based models.
We demonstrate the effectiveness of this methodology in modeling water temperature and dissolved oxygen dynamics in real-world lakes.
arXiv Detail & Related papers (2025-02-10T00:48:10Z) - Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning [51.11965014462375]
Multimodal Large Language Models (MLLMs) integrate text, images, and other modalities.
This paper argues that MLLMs can significantly advance scientific reasoning across disciplines such as mathematics, physics, chemistry, and biology.
arXiv Detail & Related papers (2025-02-05T04:05:27Z) - ClimaQA: An Automated Evaluation Framework for Climate Foundation Models [38.05357439484919]
We develop ClimaGen, an automated framework that generates question-answer pairs from graduate textbooks with climate scientists in the loop.
We present ClimaQA-Gold, an expert-annotated benchmark dataset alongside ClimaQA-Silver, a large-scale, comprehensive synthetic QA dataset for climate science.
arXiv Detail & Related papers (2024-10-22T05:12:19Z) - ChemEval: A Comprehensive Multi-Level Chemical Evaluation for Large Language Models [62.37850540570268]
Existing benchmarks in this domain fail to adequately meet the specific requirements of chemical research professionals.
ChemEval identifies 4 crucial progressive levels in chemistry, assessing 12 dimensions of LLMs across 42 distinct chemical tasks.
Results show that while general LLMs excel in literature understanding and instruction following, they fall short in tasks demanding advanced chemical knowledge.
arXiv Detail & Related papers (2024-09-21T02:50:43Z) - Recent Advances on Machine Learning for Computational Fluid Dynamics: A Survey [51.87875066383221]
This paper introduces fundamental concepts, traditional methods, and benchmark datasets, then examine the various roles Machine Learning plays in improving CFD.
We highlight real-world applications of ML for CFD in critical scientific and engineering disciplines, including aerodynamics, combustion, atmosphere & ocean science, biology fluid, plasma, symbolic regression, and reduced order modeling.
We draw the conclusion that ML is poised to significantly transform CFD research by enhancing simulation accuracy, reducing computational time, and enabling more complex analyses of fluid dynamics.
arXiv Detail & Related papers (2024-08-22T07:33:11Z) - On the Opportunities of (Re)-Exploring Atmospheric Science by Foundation Models: A Case Study [2.672038860046272]
Most state-of-the-art AI applications in atmospheric science are based on classic deep learning approaches.
This report explores how the state-of-the-art foundation model, i.e., GPT-4o, performs various atmospheric scientific tasks.
arXiv Detail & Related papers (2024-07-25T07:57:34Z) - Assessing Large Language Models on Climate Information [5.034118180129635]
We present a comprehensive evaluation framework grounded in science communication research to assess Large Language Models (LLMs)
Our framework emphasizes both presentational responses and adequacy, offering a fine-grained analysis of LLM generations spanning 8 dimensions and 30 issues.
We introduce a novel protocol for scalable oversight that relies on AI Assistance and raters with relevant education.
arXiv Detail & Related papers (2023-10-04T16:09:48Z) - SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research [11.816426823341134]
We propose SciEval, a comprehensive and multi-disciplinary evaluation benchmark to address these issues.
Based on Bloom's taxonomy, SciEval covers four dimensions to systematically evaluate scientific research ability.
Both objective and subjective questions are included in SciEval.
arXiv Detail & Related papers (2023-08-25T03:05:33Z) - SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models [70.5763210869525]
We introduce an expansive benchmark suite SciBench for Large Language Model (LLM)
SciBench contains a dataset featuring a range of collegiate-level scientific problems from mathematics, chemistry, and physics domains.
The results reveal that the current LLMs fall short of delivering satisfactory performance, with the best overall score of merely 43.22%.
arXiv Detail & Related papers (2023-07-20T07:01:57Z) - ClimaX: A foundation model for weather and climate [51.208269971019504]
ClimaX is a deep learning model for weather and climate science.
It can be pre-trained with a self-supervised learning objective on climate datasets.
It can be fine-tuned to address a breadth of climate and weather tasks.
arXiv Detail & Related papers (2023-01-24T23:19:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.