ClimateX: Do LLMs Accurately Assess Human Expert Confidence in Climate
Statements?
- URL: http://arxiv.org/abs/2311.17107v1
- Date: Tue, 28 Nov 2023 10:26:57 GMT
- Title: ClimateX: Do LLMs Accurately Assess Human Expert Confidence in Climate
Statements?
- Authors: Romain Lacombe, Kerrie Wu, Eddie Dilworth
- Abstract summary: We introduce the Expert Confidence in Climate Statements (ClimateX) dataset, a novel, curated, expert-labeled dataset consisting of 8094 climate statements.
Using this dataset, we show that recent Large Language Models (LLMs) can classify human expert confidence in climate-related statements.
Overall, models exhibit consistent and significant over-confidence on low and medium confidence statements.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Evaluating the accuracy of outputs generated by Large Language Models (LLMs)
is especially important in the climate science and policy domain. We introduce
the Expert Confidence in Climate Statements (ClimateX) dataset, a novel,
curated, expert-labeled dataset consisting of 8094 climate statements collected
from the latest Intergovernmental Panel on Climate Change (IPCC) reports,
labeled with their associated confidence levels. Using this dataset, we show
that recent LLMs can classify human expert confidence in climate-related
statements, especially in a few-shot learning setting, but with limited (up to
47%) accuracy. Overall, models exhibit consistent and significant
over-confidence on low and medium confidence statements. We highlight
implications of our results for climate communication, LLMs evaluation
strategies, and the use of LLMs in information retrieval systems.
Related papers
- ClimaQA: An Automated Evaluation Framework for Climate Foundation Models [38.05357439484919]
We develop ClimaGen, an automated framework that generates question-answer pairs from graduate textbooks with climate scientists in the loop.
We present ClimaQA-Gold, an expert-annotated benchmark dataset alongside ClimaQA-Silver, a large-scale, comprehensive synthetic QA dataset for climate science.
arXiv Detail & Related papers (2024-10-22T05:12:19Z) - LLM-PBE: Assessing Data Privacy in Large Language Models [111.58198436835036]
Large Language Models (LLMs) have become integral to numerous domains, significantly advancing applications in data management, mining, and analysis.
Despite the critical nature of this issue, there has been no existing literature to offer a comprehensive assessment of data privacy risks in LLMs.
Our paper introduces LLM-PBE, a toolkit crafted specifically for the systematic evaluation of data privacy risks in LLMs.
arXiv Detail & Related papers (2024-08-23T01:37:29Z) - Unlearning Climate Misinformation in Large Language Models [17.95497650321137]
Misinformation regarding climate change is a key roadblock in addressing one of the most serious threats to humanity.
This paper investigates factual accuracy in large language models (LLMs) regarding climate information.
arXiv Detail & Related papers (2024-05-29T23:11:53Z) - CLAMBER: A Benchmark of Identifying and Clarifying Ambiguous Information Needs in Large Language Models [60.59638232596912]
We introduce CLAMBER, a benchmark for evaluating large language models (LLMs)
Building upon the taxonomy, we construct 12K high-quality data to assess the strengths, weaknesses, and potential risks of various off-the-shelf LLMs.
Our findings indicate the limited practical utility of current LLMs in identifying and clarifying ambiguous user queries.
arXiv Detail & Related papers (2024-05-20T14:34:01Z) - KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models [53.84677081899392]
KIEval is a Knowledge-grounded Interactive Evaluation framework for large language models.
It incorporates an LLM-powered "interactor" role for the first time to accomplish a dynamic contamination-resilient evaluation.
Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization.
arXiv Detail & Related papers (2024-02-23T01:30:39Z) - TrustScore: Reference-Free Evaluation of LLM Response Trustworthiness [58.721012475577716]
Large Language Models (LLMs) have demonstrated impressive capabilities across various domains, prompting a surge in their practical applications.
This paper introduces TrustScore, a framework based on the concept of Behavioral Consistency, which evaluates whether an LLMs response aligns with its intrinsic knowledge.
arXiv Detail & Related papers (2024-02-19T21:12:14Z) - Climate Change from Large Language Models [7.190384101545232]
Climate change poses grave challenges, demanding widespread understanding and low-carbon lifestyle awareness.
Large language models (LLMs) offer a powerful tool to address this crisis.
This paper proposes an automated evaluation framework to assess climate-crisis knowledge.
arXiv Detail & Related papers (2023-12-19T09:26:46Z) - Arabic Mini-ClimateGPT : A Climate Change and Sustainability Tailored
Arabic LLM [77.17254959695218]
Large Language Models (LLMs) like ChatGPT and Bard have shown impressive conversational abilities and excel in a wide variety of NLP tasks.
We propose a light-weight Arabic Mini-ClimateGPT that is built on an open-source LLM and is specifically fine-tuned on a conversational-style instruction tuning Arabic dataset Clima500-Instruct.
Our model surpasses the baseline LLM in 88.3% of cases during ChatGPT-based evaluation.
arXiv Detail & Related papers (2023-12-14T22:04:07Z) - Enhancing Large Language Models with Climate Resources [5.2677629053588895]
Large language models (LLMs) have transformed the landscape of artificial intelligence by demonstrating their ability in generating human-like text.
However, they often employ imprecise language, which can be detrimental in domains where accuracy is crucial, such as climate change.
In this study, we make use of recent ideas to harness the potential of LLMs by viewing them as agents that access multiple sources.
We demonstrate the effectiveness of our method through a prototype agent that retrieves emission data from ClimateWatch.
arXiv Detail & Related papers (2023-03-31T20:24:14Z) - CLIMATE-FEVER: A Dataset for Verification of Real-World Climate Claims [4.574830585715129]
We introduce CLIMATE-FEVER, a new dataset for verification of climate change-related claims.
We adapt the methodology of FEVER [1], the largest dataset of artificially designed claims, to real-life claims collected from the Internet.
We discuss the surprising, subtle complexity of modeling real-world climate-related claims within the textscfever framework.
arXiv Detail & Related papers (2020-12-01T16:32:54Z) - Analyzing Sustainability Reports Using Natural Language Processing [68.8204255655161]
In recent years, companies have increasingly been aiming to both mitigate their environmental impact and adapt to the changing climate context.
This is reported via increasingly exhaustive reports, which cover many types of climate risks and exposures under the umbrella of Environmental, Social, and Governance (ESG)
We present this tool and the methodology that we used to develop it in the present article.
arXiv Detail & Related papers (2020-11-03T21:22:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.