EarthSE: A Benchmark for Evaluating Earth Scientific Exploration Capability of LLMs
- URL: http://arxiv.org/abs/2505.17139v3
- Date: Fri, 30 May 2025 07:31:07 GMT
- Title: EarthSE: A Benchmark for Evaluating Earth Scientific Exploration Capability of LLMs
- Authors: Wanghan Xu, Xiangyu Zhao, Yuhao Zhou, Xiaoyu Yue, Ben Fei, Fenghua Ling, Wenlong Zhang, Lei Bai,
- Abstract summary: We present a professional benchmark for the Earth sciences, designed to evaluate the capabilities of Large Language Models (LLMs) in scientific exploration.<n>Leveraging a corpus of 100,000 research papers, we first construct two Question Answering (QA) datasets: Earth-Iron and Earth-Silver.<n>These datasets encompass five Earth spheres, 114 disciplines, and 11 task categories, assessing knowledge crucial for scientific exploration.
- Score: 36.72915099998998
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Advancements in Large Language Models (LLMs) drive interest in scientific applications, necessitating specialized benchmarks such as Earth science. Existing benchmarks either present a general science focus devoid of Earth science specificity or cover isolated subdomains, lacking holistic evaluation. Furthermore, current benchmarks typically neglect the assessment of LLMs' capabilities in open-ended scientific exploration. In this paper, we present a comprehensive and professional benchmark for the Earth sciences, designed to evaluate the capabilities of LLMs in scientific exploration within this domain, spanning from fundamental to advanced levels. Leveraging a corpus of 100,000 research papers, we first construct two Question Answering (QA) datasets: Earth-Iron, which offers extensive question coverage for broad assessment, and Earth-Silver, which features a higher level of difficulty to evaluate professional depth. These datasets encompass five Earth spheres, 114 disciplines, and 11 task categories, assessing foundational knowledge crucial for scientific exploration. Most notably, we introduce Earth-Gold with new metrics, a dataset comprising open-ended multi-turn dialogues specifically designed to evaluate the advanced capabilities of LLMs in scientific exploration, including methodology induction, limitation analysis, and concept proposal. Extensive experiments reveal limitations in 11 leading LLMs across different domains and tasks, highlighting considerable room for improvement in their scientific exploration capabilities. The benchmark is available on https://huggingface.co/ai-earth .
Related papers
- Scientists' First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning [59.518397361341556]
We present the Scientists' First Exam (SFE) benchmark, designed to evaluate the scientific cognitive capacities of Multimodal Large Language Models (MLLMs)<n>SFE comprises 830 expert-verified VQA pairs across three question types, spanning 66 multimodal tasks across five high-value disciplines.<n>Experiments reveal that current state-of-the-art GPT-o3 and InternVL-3 achieve only 34.08% and 26.52% on SFE, highlighting significant room for MLLMs to improve in scientific realms.
arXiv Detail & Related papers (2025-06-12T09:29:16Z) - MSEarth: A Benchmark for Multimodal Scientific Comprehension of Earth Science [39.224598950224454]
We introduce MSEarth, a multimodal scientific benchmark curated from high-quality, open-access scientific publications.<n>MSEarth encompasses the five major spheres of Earth science: atmosphere, cryosphere, hydrosphere, lithosphere, and biosphere.<n>MSEarth supports a variety of tasks, including scientific figure captioning, multiple choice questions, and open-ended reasoning challenges.
arXiv Detail & Related papers (2025-05-27T05:30:35Z) - Bridging AI and Carbon Capture: A Dataset for LLMs in Ionic Liquids and CBE Research [3.2995359570845912]
Large Language Models (LLMs) have demonstrated exceptional performance in general knowledge and reasoning tasks.<n>Their effectiveness in specialized scientific fields like Chemical and Biological Engineering (CBE) remains underexplored.<n>We release an expert - curated dataset of 5,920 examples designed to benchmark LLMs' reasoning in CBE.
arXiv Detail & Related papers (2025-05-11T12:32:57Z) - ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition [67.26124739345332]
Large language models (LLMs) have demonstrated potential in assisting scientific research, yet their ability to discover high-quality research hypotheses remains unexamined.<n>We introduce the first large-scale benchmark for evaluating LLMs with a near-sufficient set of sub-tasks of scientific discovery.<n>We develop an automated framework that extracts critical components - research questions, background surveys, inspirations, and hypotheses - from scientific papers.
arXiv Detail & Related papers (2025-03-27T08:09:15Z) - SciHorizon: Benchmarking AI-for-Science Readiness from Scientific Data to Large Language Models [36.724471610075696]
We propose SciHorizon, a comprehensive assessment framework designed to benchmark the readiness of AI4Science.<n>First, we introduce a generalizable framework for assessing AI-ready scientific data, encompassing four key dimensions: Quality, FAIRness, Explainability, and Compliance.<n>We present recommendation lists of AI-ready datasets for Earth, Life, and Materials Sciences, making a novel and original contribution to the field.
arXiv Detail & Related papers (2025-03-12T11:34:41Z) - Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning [51.11965014462375]
Multimodal Large Language Models (MLLMs) integrate text, images, and other modalities.<n>This paper argues that MLLMs can significantly advance scientific reasoning across disciplines such as mathematics, physics, chemistry, and biology.
arXiv Detail & Related papers (2025-02-05T04:05:27Z) - A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery [68.48094108571432]
Large language models (LLMs) have revolutionized the way text and other modalities of data are handled.
We aim to provide a more holistic view of the research landscape by unveiling cross-field and cross-modal connections between scientific LLMs.
arXiv Detail & Related papers (2024-06-16T08:03:24Z) - SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models [35.98892300665275]
We introduce the SciKnowEval benchmark, a framework that evaluates large language models (LLMs) across five progressive levels of scientific knowledge.
These levels aim to assess the breadth and depth of scientific knowledge in LLMs, including memory, comprehension, reasoning, discernment, and application.
We benchmark 26 advanced open-source and proprietary LLMs using zero-shot and few-shot prompting strategies.
arXiv Detail & Related papers (2024-06-13T13:27:52Z) - Scientific Large Language Models: A Survey on Biological & Chemical Domains [47.97810890521825]
Large Language Models (LLMs) have emerged as a transformative power in enhancing natural language comprehension.
The application of LLMs extends beyond conventional linguistic boundaries, encompassing specialized linguistic systems developed within various scientific disciplines.
As a burgeoning area in the community of AI for Science, scientific LLMs warrant comprehensive exploration.
arXiv Detail & Related papers (2024-01-26T05:33:34Z) - SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research [11.816426823341134]
We propose SciEval, a comprehensive and multi-disciplinary evaluation benchmark to address these issues.
Based on Bloom's taxonomy, SciEval covers four dimensions to systematically evaluate scientific research ability.
Both objective and subjective questions are included in SciEval.
arXiv Detail & Related papers (2023-08-25T03:05:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.