Related papers: ClimateX: Do LLMs Accurately Assess Human Expert Confidence in Climate Statements?

ClimateX: Do LLMs Accurately Assess Human Expert Confidence in Climate Statements?

URL: http://arxiv.org/abs/2311.17107v1
Date: Tue, 28 Nov 2023 10:26:57 GMT
Title: ClimateX: Do LLMs Accurately Assess Human Expert Confidence in Climate Statements?
Authors: Romain Lacombe, Kerrie Wu, Eddie Dilworth
Abstract summary: We introduce the Expert Confidence in Climate Statements (ClimateX) dataset, a novel, curated, expert-labeled dataset consisting of 8094 climate statements. Using this dataset, we show that recent Large Language Models (LLMs) can classify human expert confidence in climate-related statements. Overall, models exhibit consistent and significant over-confidence on low and medium confidence statements.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Evaluating the accuracy of outputs generated by Large Language Models (LLMs) is especially important in the climate science and policy domain. We introduce the Expert Confidence in Climate Statements (ClimateX) dataset, a novel, curated, expert-labeled dataset consisting of 8094 climate statements collected from the latest Intergovernmental Panel on Climate Change (IPCC) reports, labeled with their associated confidence levels. Using this dataset, we show that recent LLMs can classify human expert confidence in climate-related statements, especially in a few-shot learning setting, but with limited (up to 47%) accuracy. Overall, models exhibit consistent and significant over-confidence on low and medium confidence statements. We highlight implications of our results for climate communication, LLMs evaluation strategies, and the use of LLMs in information retrieval systems.

Related papers

ClimateBench-M: A Multi-Modal Climate Data Benchmark with a Simple Generative Method [61.76389719956301]
We contribute a multi-modal climate benchmark, i.e., ClimateBench-M, which aligns time series climate data from ERA5, extreme weather events data from NOAA, and satellite image data from NASA. Under each data modality, we also propose a simple but strong generative method that could produce competitive performance in weather forecasting, thunderstorm alerts, and crop segmentation tasks.
arXiv Detail & Related papers (2025-04-10T02:22:23Z)
CliME: Evaluating Multimodal Climate Discourse on Social Media and the Climate Alignment Quotient (CAQ) [14.065907685322097]
CliME is a first-of-its-kind multimodal dataset, comprising 2579 Twitter and Reddit posts. The benchmark features a diverse collection of humorous memes and skeptical posts, capturing how these formats distill complex issues into viral narratives that shape public opinion and policy discussions. We present the Climate Alignment Quotient (CAQ), a novel metric comprising five distinct dimensions: Articulation, Evidence, Resonance, Transition, and Specificity. Our findings, based on the CAQ metric, indicate that while most evaluated LLMs perform relatively well in Criticality and Justice, they consistently underperform on the Actionability axis.
arXiv Detail & Related papers (2025-04-04T20:01:00Z)
Enhancing LLMs for Governance with Human Oversight: Evaluating and Aligning LLMs on Expert Classification of Climate Misinformation for Detecting False or Misleading Claims about Climate Change [0.0]
Climate misinformation is a problem that has the potential to be substantially aggravated by the development of Large Language Models (LLMs) In this study we evaluate the potential for LLMs to be part of the solution for mitigating online dis/misinformation rather than the problem.
arXiv Detail & Related papers (2025-01-23T16:21:15Z)
Exploring Large Language Models for Climate Forecasting [5.25781442142288]
Large language models (LLMs) present a promising approach to bridging the gap between complex climate data and the general public. This study investigates the capability of GPT-4 in predicting rainfall at short-term (15-day) and long-term (12-month) scales.
arXiv Detail & Related papers (2024-11-20T21:58:19Z)
ClimaQA: An Automated Evaluation Framework for Climate Foundation Models [38.05357439484919]
We develop ClimaGen, an automated framework that generates question-answer pairs from graduate textbooks with climate scientists in the loop. We present ClimaQA-Gold, an expert-annotated benchmark dataset alongside ClimaQA-Silver, a large-scale, comprehensive synthetic QA dataset for climate science.
arXiv Detail & Related papers (2024-10-22T05:12:19Z)
LLM-PBE: Assessing Data Privacy in Large Language Models [111.58198436835036]
Large Language Models (LLMs) have become integral to numerous domains, significantly advancing applications in data management, mining, and analysis. Despite the critical nature of this issue, there has been no existing literature to offer a comprehensive assessment of data privacy risks in LLMs. Our paper introduces LLM-PBE, a toolkit crafted specifically for the systematic evaluation of data privacy risks in LLMs.
arXiv Detail & Related papers (2024-08-23T01:37:29Z)
Unlearning Climate Misinformation in Large Language Models [17.95497650321137]
Misinformation regarding climate change is a key roadblock in addressing one of the most serious threats to humanity. This paper investigates factual accuracy in large language models (LLMs) regarding climate information.
arXiv Detail & Related papers (2024-05-29T23:11:53Z)
CLAMBER: A Benchmark of Identifying and Clarifying Ambiguous Information Needs in Large Language Models [60.59638232596912]
We introduce CLAMBER, a benchmark for evaluating large language models (LLMs) Building upon the taxonomy, we construct 12K high-quality data to assess the strengths, weaknesses, and potential risks of various off-the-shelf LLMs. Our findings indicate the limited practical utility of current LLMs in identifying and clarifying ambiguous user queries.
arXiv Detail & Related papers (2024-05-20T14:34:01Z)
KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models [53.84677081899392]
KIEval is a Knowledge-grounded Interactive Evaluation framework for large language models. It incorporates an LLM-powered "interactor" role for the first time to accomplish a dynamic contamination-resilient evaluation. Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization.
arXiv Detail & Related papers (2024-02-23T01:30:39Z)
TrustScore: Reference-Free Evaluation of LLM Response Trustworthiness [58.721012475577716]
Large Language Models (LLMs) have demonstrated impressive capabilities across various domains, prompting a surge in their practical applications. This paper introduces TrustScore, a framework based on the concept of Behavioral Consistency, which evaluates whether an LLMs response aligns with its intrinsic knowledge.
arXiv Detail & Related papers (2024-02-19T21:12:14Z)
Climate Change from Large Language Models [7.190384101545232]
Climate change poses grave challenges, demanding widespread understanding and low-carbon lifestyle awareness. Large language models (LLMs) offer a powerful tool to address this crisis. This paper proposes an automated evaluation framework to assess climate-crisis knowledge.
arXiv Detail & Related papers (2023-12-19T09:26:46Z)
Arabic Mini-ClimateGPT : A Climate Change and Sustainability Tailored Arabic LLM [77.17254959695218]
Large Language Models (LLMs) like ChatGPT and Bard have shown impressive conversational abilities and excel in a wide variety of NLP tasks. We propose a light-weight Arabic Mini-ClimateGPT that is built on an open-source LLM and is specifically fine-tuned on a conversational-style instruction tuning Arabic dataset Clima500-Instruct. Our model surpasses the baseline LLM in 88.3% of cases during ChatGPT-based evaluation.
arXiv Detail & Related papers (2023-12-14T22:04:07Z)
Enhancing Large Language Models with Climate Resources [5.2677629053588895]
Large language models (LLMs) have transformed the landscape of artificial intelligence by demonstrating their ability in generating human-like text. However, they often employ imprecise language, which can be detrimental in domains where accuracy is crucial, such as climate change. In this study, we make use of recent ideas to harness the potential of LLMs by viewing them as agents that access multiple sources. We demonstrate the effectiveness of our method through a prototype agent that retrieves emission data from ClimateWatch.
arXiv Detail & Related papers (2023-03-31T20:24:14Z)
Analyzing Sustainability Reports Using Natural Language Processing [68.8204255655161]
In recent years, companies have increasingly been aiming to both mitigate their environmental impact and adapt to the changing climate context. This is reported via increasingly exhaustive reports, which cover many types of climate risks and exposures under the umbrella of Environmental, Social, and Governance (ESG) We present this tool and the methodology that we used to develop it in the present article.
arXiv Detail & Related papers (2020-11-03T21:22:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.