chatClimate: Grounding Conversational AI in Climate Science
- URL: http://arxiv.org/abs/2304.05510v2
- Date: Fri, 28 Apr 2023 15:07:41 GMT
- Title: chatClimate: Grounding Conversational AI in Climate Science
- Authors: Saeid Ashraf Vaghefi, Qian Wang, Veruska Muccione, Jingwei Ni, Mathias
Kraus, Julia Bingler, Tobias Schimanski, Chiara Colesanti-Senni, Nicolas
Webersinke, Christrian Huggel, Markus Leippold
- Abstract summary: Large Language Models (LLMs) still face two major challenges: hallucination and outdated information after the training phase.
We present our conversational AI prototype, available at www.chatclimate.ai, and demonstrate its ability to answer challenging questions accurately.
The answers and their sources were evaluated by our team of IPCC authors, who used their expert knowledge to score the accuracy of the answers from 1 (very-low) to 5 (very-high)
- Score: 9.043032065867536
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have made significant progress in recent years,
achieving remarkable results in question-answering tasks (QA). However, they
still face two major challenges: hallucination and outdated information after
the training phase. These challenges take center stage in critical domains like
climate change, where obtaining accurate and up-to-date information from
reliable sources in a limited time is essential and difficult. To overcome
these barriers, one potential solution is to provide LLMs with access to
external, scientifically accurate, and robust sources (long-term memory) to
continuously update their knowledge and prevent the propagation of inaccurate,
incorrect, or outdated information. In this study, we enhanced GPT-4 by
integrating the information from the Sixth Assessment Report of the
Intergovernmental (IPCC AR6), the most comprehensive, up-to-date, and reliable
source in this domain. We present our conversational AI prototype, available at
www.chatclimate.ai and demonstrate its ability to answer challenging questions
accurately in three different QA scenarios: asking from 1) GPT-4, 2)
chatClimate, and 3) hybrid chatClimate. The answers and their sources were
evaluated by our team of IPCC authors, who used their expert knowledge to score
the accuracy of the answers from 1 (very-low) to 5 (very-high). The evaluation
showed that the hybrid chatClimate provided more accurate answers, highlighting
the effectiveness of our solution. This approach can be easily scaled for
chatbots in specific domains, enabling the delivery of reliable and accurate
information.
Related papers
- AI-Enabled grading with near-domain data for scaling feedback with human-level accuracy [0.5735035463793009]
This paper proposes a novel and practical approach to grade short-answer constructed-response questions.<n>Our framework does not require pre-written grading rubrics and is designed explicitly with practical classroom settings in mind.
arXiv Detail & Related papers (2025-12-01T05:11:37Z) - ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning [118.46980291324148]
ATLAS is a large-scale, high-difficulty, and cross-disciplinary evaluation suite composed of approximately 800 original problems.<n>Its key features include: High Originality and Contamination Resistance, with all questions newly created or substantially adapted to prevent test data leakage.<n>Preliminary results on leading models demonstrate ATLAS's effectiveness in differentiating their advanced scientific reasoning capabilities.
arXiv Detail & Related papers (2025-11-18T11:13:06Z) - Assessing Web Search Credibility and Response Groundedness in Chat Assistants [4.0127354590894955]
We introduce a novel methodology for evaluating assistants' web search behavior.<n>Using 100 claims across five misinformation-prone topics, we assess GPT-4o, GPT-5, Perplexity, and Qwen Chat.
arXiv Detail & Related papers (2025-10-15T16:55:47Z) - A Self-Evolving AI Agent System for Climate Science [59.08800209508371]
We introduce EarthLink, the first self-evolving AI agent system designed as an interactive "copilot" for Earth scientists.<n>Through natural language interaction, EarthLink automates the entire research workflow by integrating planning, code execution, data analysis, and physical reasoning.<n>It exhibits human-like cross-disciplinary analytical ability and proficiency comparable to a junior researcher in expert evaluations on core large-scale climate tasks.
arXiv Detail & Related papers (2025-07-23T08:29:25Z) - Are Frontier Large Language Models Suitable for Q&A in Science Centres? [0.4326762849037007]
This paper investigates the suitability of frontier Large Language Models (LLMs) for Q&A interactions in science centres.
We evaluated responses generated by three leading models: OpenAI's GPT-4, Claude 3.5 Sonnet, and Google Gemini 1.5.
The results revealed a trade-off between creativity and accuracy, with Claude outperforming GPT and Gemini in both maintaining clarity and engaging young audiences.
arXiv Detail & Related papers (2024-12-06T17:28:43Z) - Synergizing LLMs and Knowledge Graphs: A Novel Approach to Software Repository-Related Question Answering [3.076436880934678]
Software repositories contain valuable information for gaining insights into their development process.
extracting insights from these repository data is time-consuming and requires technical expertise.
This study aims to improve the accuracy of LLM-based chatbots in answering repository-related questions by augmenting them with knowledge graphs.
arXiv Detail & Related papers (2024-12-05T02:18:03Z) - Adaptive Question Answering: Enhancing Language Model Proficiency for Addressing Knowledge Conflicts with Source Citations [3.3018718917393297]
We propose the novel task of Question Answering with source citation in ambiguous settings, where multiple valid answers exist.
We create a comprehensive framework consisting of: (1) five novel datasets; (2) the first ambiguous multi-hop QA dataset featuring real-world, naturally occurring contexts; and (3) two new metrics to evaluate models' performances.
We hope that this new task, datasets, metrics, and baselines will inspire the community to push the boundaries of QA research and develop more trustworthy and interpretable systems.
arXiv Detail & Related papers (2024-10-05T17:37:01Z) - Crowd Intelligence for Early Misinformation Prediction on Social Media [29.494819549803772]
We introduce CROWDSHIELD, a crowd intelligence-based method for early misinformation prediction.
We employ Q-learning to capture the two dimensions -- stances and claims.
We propose MIST, a manually annotated misinformation detection Twitter corpus.
arXiv Detail & Related papers (2024-08-08T13:45:23Z) - Analyzing Human Questioning Behavior and Causal Curiosity through Natural Queries [91.70689724416698]
We present NatQuest, a collection of 13,500 naturally occurring questions from three diverse sources.
Our analysis reveals a significant presence of causal questions (up to 42%) within the dataset.
arXiv Detail & Related papers (2024-05-30T17:55:28Z) - The Battle of LLMs: A Comparative Study in Conversational QA Tasks [0.0]
This research delves into the responses generated by ChatGPT, GPT-4, Gemini, Mixtral and Claude across different Conversational QA corpora.
Evaluation scores were meticulously computed and subsequently compared to ascertain the overall performance of these models.
arXiv Detail & Related papers (2024-05-28T16:42:43Z) - InfoLossQA: Characterizing and Recovering Information Loss in Text Simplification [60.10193972862099]
This work proposes a framework to characterize and recover simplification-induced information loss in form of question-and-answer pairs.
QA pairs are designed to help readers deepen their knowledge of a text.
arXiv Detail & Related papers (2024-01-29T19:00:01Z) - The Earth is Flat? Unveiling Factual Errors in Large Language Models [89.94270049334479]
Large Language Models (LLMs) like ChatGPT are in various applications due to their extensive knowledge from pre-training and fine-tuning.
Despite this, they are prone to generating factual and commonsense errors, raising concerns in critical areas like healthcare, journalism, and education.
We introduce a novel, automatic testing framework, FactChecker, aimed at uncovering factual inaccuracies in LLMs.
arXiv Detail & Related papers (2024-01-01T14:02:27Z) - Learning to Break: Knowledge-Enhanced Reasoning in Multi-Agent Debate System [16.830182915504555]
Multi-agent debate system (MAD) imitates the process of human discussion in pursuit of truth.
It is challenging to make various agents perform right and highly consistent cognition due to their limited and different knowledge backgrounds.
We propose a novel underlineMulti-underlineAgent underlineDebate with underlineKnowledge-underlineEnhanced framework to promote the system to find the solution.
arXiv Detail & Related papers (2023-12-08T06:22:12Z) - Competition-Level Problems are Effective LLM Evaluators [121.15880285283116]
This paper aims to evaluate the reasoning capacities of large language models (LLMs) in solving recent programming problems in Codeforces.
We first provide a comprehensive evaluation of GPT-4's peiceived zero-shot performance on this task, considering various aspects such as problems' release time, difficulties, and types of errors encountered.
Surprisingly, theThoughtived performance of GPT-4 has experienced a cliff like decline in problems after September 2021 consistently across all the difficulties and types of problems.
arXiv Detail & Related papers (2023-12-04T18:58:57Z) - ChatGPT versus Traditional Question Answering for Knowledge Graphs:
Current Status and Future Directions Towards Knowledge Graph Chatbots [7.2676028986202]
Conversational AI and Question-Answering systems (QASs) for knowledge graphs (KGs) are both emerging research areas.
QASs retrieve the most recent information from a KG by understanding and translating the natural language question into a formal query supported by the database engine.
Our framework compares two representative conversational models, ChatGPT and Galactica, against KGQAN, the current state-of-the-art QAS.
arXiv Detail & Related papers (2023-02-08T13:03:27Z) - A Survey for Efficient Open Domain Question Answering [51.67110249787223]
Open domain question answering (ODQA) is a longstanding task aimed at answering factual questions from a large knowledge corpus without any explicit evidence in natural language processing (NLP)
arXiv Detail & Related papers (2022-11-15T04:18:53Z) - RealTime QA: What's the Answer Right Now? [137.04039209995932]
We introduce REALTIME QA, a dynamic question answering (QA) platform that announces questions and evaluates systems on a regular basis.
We build strong baseline models upon large pretrained language models, including GPT-3 and T5.
GPT-3 tends to return outdated answers when retrieved documents do not provide sufficient information to find an answer.
arXiv Detail & Related papers (2022-07-27T07:26:01Z) - Logic-Guided Data Augmentation and Regularization for Consistent
Question Answering [55.05667583529711]
This paper addresses the problem of improving the accuracy and consistency of responses to comparison questions.
Our method leverages logical and linguistic knowledge to augment labeled training data and then uses a consistency-based regularizer to train the model.
arXiv Detail & Related papers (2020-04-21T17:03:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.