ClimaQA: An Automated Evaluation Framework for Climate Question Answering Models
- URL: http://arxiv.org/abs/2410.16701v2
- Date: Sun, 09 Mar 2025 18:31:12 GMT
- Title: ClimaQA: An Automated Evaluation Framework for Climate Question Answering Models
- Authors: Veeramakali Vignesh Manivannan, Yasaman Jafari, Srikar Eranky, Spencer Ho, Rose Yu, Duncan Watson-Parris, Yian Ma, Leon Bergen, Taylor Berg-Kirkpatrick,
- Abstract summary: We develop ClimaGen, an adaptive learning framework that generates question-answer pairs from graduate textbooks with climate scientists in the loop.<n>We present ClimaQA-Gold, an expert-annotated benchmark dataset alongside ClimaQA-Silver, a large-scale, comprehensive synthetic QA dataset for climate science.
- Score: 38.05357439484919
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The use of Large Language Models (LLMs) in climate science has recently gained significant attention. However, a critical issue remains: the lack of a comprehensive evaluation framework capable of assessing the quality and scientific validity of model outputs. To address this issue, we develop ClimaGen (Climate QA Generator), an adaptive learning framework that generates question-answer pairs from graduate textbooks with climate scientists in the loop. As a result, we present ClimaQA-Gold, an expert-annotated benchmark dataset alongside ClimaQA-Silver, a large-scale, comprehensive synthetic QA dataset for climate science. Finally, we develop evaluation strategies and compare different LLMs on our benchmarks. Our results offer novel insights into various approaches used to enhance knowledge of climate LLMs. The source code is publicly available at https://github.com/Rose-STL-Lab/genie-climaqa
Related papers
- ClimateBench-M: A Multi-Modal Climate Data Benchmark with a Simple Generative Method [61.76389719956301]
We contribute a multi-modal climate benchmark, i.e., ClimateBench-M, which aligns time series climate data from ERA5, extreme weather events data from NOAA, and satellite image data from NASA.
Under each data modality, we also propose a simple but strong generative method that could produce competitive performance in weather forecasting, thunderstorm alerts, and crop segmentation tasks.
arXiv Detail & Related papers (2025-04-10T02:22:23Z) - CliME: Evaluating Multimodal Climate Discourse on Social Media and the Climate Alignment Quotient (CAQ) [14.065907685322097]
CliME is a first-of-its-kind multimodal dataset, comprising 2579 Twitter and Reddit posts.
The benchmark features a diverse collection of humorous memes and skeptical posts, capturing how these formats distill complex issues into viral narratives that shape public opinion and policy discussions.
We present the Climate Alignment Quotient (CAQ), a novel metric comprising five distinct dimensions: Articulation, Evidence, Resonance, Transition, and Specificity.
Our findings, based on the CAQ metric, indicate that while most evaluated LLMs perform relatively well in Criticality and Justice, they consistently underperform on the Actionability axis.
arXiv Detail & Related papers (2025-04-04T20:01:00Z) - ScholarChemQA: Unveiling the Power of Language Models in Chemical Research Question Answering [54.80411755871931]
Question Answering (QA) effectively evaluates language models' reasoning and knowledge depth.
Chemical QA plays a crucial role in both education and research by effectively translating complex chemical information into readily understandable format.
This dataset reflects typical real-world challenges, including an imbalanced data distribution and a substantial amount of unlabeled data that can be potentially useful.
We introduce a QAMatch model, specifically designed to effectively answer chemical questions by fully leveraging our collected data.
arXiv Detail & Related papers (2024-07-24T01:46:55Z) - Advancing Data-driven Weather Forecasting: Time-Sliding Data
Augmentation of ERA5 [3.3748750222488657]
We introduce a novel strategy that deviates from the common dependence on high-resolution data.
This paper improves on conventional approaches by adding more variables and a novel approach to data augmentation and processing.
Our findings reveal that despite the lower resolution, the proposed approach demonstrates considerable accuracy in predicting atmospheric conditions.
arXiv Detail & Related papers (2024-02-13T03:01:22Z) - FengWu-GHR: Learning the Kilometer-scale Medium-range Global Weather
Forecasting [56.73502043159699]
This work presents FengWu-GHR, the first data-driven global weather forecasting model running at the 0.09$circ$ horizontal resolution.
It introduces a novel approach that opens the door for operating ML-based high-resolution forecasts by inheriting prior knowledge from a low-resolution model.
The hindcast of weather prediction in 2022 indicates that FengWu-GHR is superior to the IFS-HRES.
arXiv Detail & Related papers (2024-01-28T13:23:25Z) - ClimateGPT: Towards AI Synthesizing Interdisciplinary Research on
Climate Change [21.827936253363603]
This paper introduces ClimateGPT, a model family of domain-specific large language models that synthesize interdisciplinary research on climate change.
We trained two 7B models from scratch on a science-oriented dataset of 300B tokens.
ClimateGPT-7B, 13B and 70B are continuously pre-trained from Llama2 on a domain-specific dataset of 4.2B tokens.
arXiv Detail & Related papers (2024-01-17T23:29:46Z) - Climate Change from Large Language Models [7.190384101545232]
Climate change poses grave challenges, demanding widespread understanding and low-carbon lifestyle awareness.
Large language models (LLMs) offer a powerful tool to address this crisis.
This paper proposes an automated evaluation framework to assess climate-crisis knowledge.
arXiv Detail & Related papers (2023-12-19T09:26:46Z) - Arabic Mini-ClimateGPT : A Climate Change and Sustainability Tailored
Arabic LLM [77.17254959695218]
Large Language Models (LLMs) like ChatGPT and Bard have shown impressive conversational abilities and excel in a wide variety of NLP tasks.
We propose a light-weight Arabic Mini-ClimateGPT that is built on an open-source LLM and is specifically fine-tuned on a conversational-style instruction tuning Arabic dataset Clima500-Instruct.
Our model surpasses the baseline LLM in 88.3% of cases during ChatGPT-based evaluation.
arXiv Detail & Related papers (2023-12-14T22:04:07Z) - ClimateSet: A Large-Scale Climate Model Dataset for Machine Learning [26.151056828513962]
Climate models have been key for assessing the impact of climate change and simulating future climate scenarios.
The machine learning (ML) community has taken an increased interest in supporting climate scientists' efforts on various tasks such as climate model emulation, downscaling, and prediction tasks.
Here, we introduce ClimateSet, a dataset containing the inputs and outputs of 36 climate models from the Input4MIPs and CMIP6 archives.
arXiv Detail & Related papers (2023-11-07T04:55:36Z) - ClimateLearn: Benchmarking Machine Learning for Weather and Climate
Modeling [20.63843548201849]
ClimateLearn is an open-source library that vastly simplifies the training and evaluation of machine learning models for data-driven climate science.
It is the first large-scale, open-source effort for bridging research in weather and climate modeling with modern machine learning systems.
arXiv Detail & Related papers (2023-07-04T20:36:01Z) - ClimaX: A foundation model for weather and climate [51.208269971019504]
ClimaX is a deep learning model for weather and climate science.
It can be pre-trained with a self-supervised learning objective on climate datasets.
It can be fine-tuned to address a breadth of climate and weather tasks.
arXiv Detail & Related papers (2023-01-24T23:19:01Z) - Towards Answering Climate Questionnaires from Unstructured Climate
Reports [26.036105166376284]
Activists and policymakers need NLP tools to process the vast and rapidly growing unstructured textual climate reports into structured form.
We introduce two new large-scale climate questionnaire datasets and use their existing structure to train self-supervised models.
We then use these models to help align texts from unstructured climate documents to the semi-structured questionnaires in a human pilot study.
arXiv Detail & Related papers (2023-01-11T00:22:56Z) - Spatiotemporal modeling of European paleoclimate using doubly sparse
Gaussian processes [61.31361524229248]
We build on recent scale sparsetemporal GPs to reduce the computational burden.
We successfully employ such a doubly sparse GP to construct a probabilistic model of paleoclimate.
arXiv Detail & Related papers (2022-11-15T14:15:04Z) - Climate-Invariant Machine Learning [0.8831201550856289]
Current climate models require representations of processes that occur at scales smaller than model grid size.
Recent machine learning (ML) algorithms hold promise to improve such process representations, but tend to extrapolate poorly to climate regimes they were not trained on.
We propose a new framework - termed "climate-invariant" ML - incorporating knowledge of climate processes into ML algorithms.
arXiv Detail & Related papers (2021-12-14T07:02:57Z) - Analyzing Sustainability Reports Using Natural Language Processing [68.8204255655161]
In recent years, companies have increasingly been aiming to both mitigate their environmental impact and adapt to the changing climate context.
This is reported via increasingly exhaustive reports, which cover many types of climate risks and exposures under the umbrella of Environmental, Social, and Governance (ESG)
We present this tool and the methodology that we used to develop it in the present article.
arXiv Detail & Related papers (2020-11-03T21:22:42Z) - HECT: High-Dimensional Ensemble Consistency Testing for Climate Models [1.7587442088965226]
Climate models play a crucial role in understanding the effect of environmental changes on climate to help mitigate climate risks and inform decisions.
Large global climate models such as the Community Earth System Model (CESM), are very complex with millions of lines of code describing interactions of the atmosphere, land, oceans, and ice.
Our work uses probabilistics like tree-based algorithms and deep neural networks to perform a statistically rigorous goodness-of-fit test of high-dimensional and man-made data.
arXiv Detail & Related papers (2020-10-08T15:16:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.