Related papers: CLINB: A Climate Intelligence Benchmark for Foundational Models

CLINB: A Climate Intelligence Benchmark for Foundational Models

URL: http://arxiv.org/abs/2511.11597v1
Date: Wed, 29 Oct 2025 16:15:42 GMT
Title: CLINB: A Climate Intelligence Benchmark for Foundational Models
Authors: Michelle Chen Huebscher, Katharine Mach, Aleksandar Stanić, Markus Leippold, Ben Gaiarin, Zeke Hausfather, Elisa Rawat, Erich Fischer, Massimiliano Ciaramita, Joeri Rogelj, Christian Buck, Lierni Sestorain Saralegui, Reto Knutti,
Abstract summary: We introduce CLINB, a benchmark that assesses models on open-ended, grounded, multimodal question answering tasks.<n>We implement and validate a model-based evaluation process and evaluate several frontier models.
Score: 31.884362929125363
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Evaluating how Large Language Models (LLMs) handle complex, specialized knowledge remains a critical challenge. We address this through the lens of climate change by introducing CLINB, a benchmark that assesses models on open-ended, grounded, multimodal question answering tasks with clear requirements for knowledge quality and evidential support. CLINB relies on a dataset of real users' questions and evaluation rubrics curated by leading climate scientists. We implement and validate a model-based evaluation process and evaluate several frontier models. Our findings reveal a critical dichotomy. Frontier models demonstrate remarkable knowledge synthesis capabilities, often exhibiting PhD-level understanding and presentation quality. They outperform "hybrid" answers curated by domain experts assisted by weaker models. However, this performance is countered by failures in grounding. The quality of evidence varies, with substantial hallucination rates for references and images. We argue that bridging this gap between knowledge synthesis and verifiable attribution is essential for the deployment of AI in scientific workflows and that reliable, interpretable benchmarks like CLINB are needed to progress towards building trustworthy AI systems.

Related papers

Epistemic Context Learning: Building Trust the Right Way in LLM-Based Multi-Agent Systems [94.9141394384021]
Individual agents in multi-agent systems often lack robustness, tending to blindly conform to misleading peers.<n>We show this weakness stems from both sycophancy and inadequate ability to evaluate peer reliability.<n>We first formalize the learning problem of history-aware reference, introducing the historical interactions of peers as additional input.<n>We then develop Epistemic Context Learning (ECL), a reasoning framework that conditions predictions on explicitly-built peer profiles from history.
arXiv Detail & Related papers (2026-01-29T13:59:32Z)
CAIRNS: Balancing Readability and Scientific Accuracy in Climate Adaptation Question Answering [10.31170458584116]
We present Climate Adaptation question-answering with Improved Readability and Noted Sources (CAIRNS)<n>CAIRNS is a framework that enables experts to obtain credible preliminary answers from complex evidence sources from the web.<n>It enhances readability and citation reliability through a structured ScholarGuide prompt and achieves robust evaluation.
arXiv Detail & Related papers (2025-12-01T22:44:43Z)
ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning [118.46980291324148]
ATLAS is a large-scale, high-difficulty, and cross-disciplinary evaluation suite composed of approximately 800 original problems.<n>Its key features include: High Originality and Contamination Resistance, with all questions newly created or substantially adapted to prevent test data leakage.<n>Preliminary results on leading models demonstrate ATLAS's effectiveness in differentiating their advanced scientific reasoning capabilities.
arXiv Detail & Related papers (2025-11-18T11:13:06Z)
Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics [89.1999907891494]
We present WebDetective, a benchmark of hint-free multi-hop questions paired with a controlled Wikipedia sandbox.<n>Our evaluation of 25 state-of-the-art models reveals systematic weaknesses across all architectures.<n>We develop an agentic workflow, EvidenceLoop, that explicitly targets the challenges our benchmark identifies.
arXiv Detail & Related papers (2025-10-01T07:59:03Z)
KnowMT-Bench: Benchmarking Knowledge-Intensive Long-Form Question Answering in Multi-Turn Dialogues [58.305425399644086]
Multi-Turn Long-Form Question Answering (MT-LFQA) is a key application paradigm of Large Language Models (LLMs) in knowledge-intensive domains.<n>We introduce textbfKnowMT-Bench, the textitfirst-ever benchmark designed to systematically evaluate MT-LFQA for LLMs across knowledge-intensive fields.
arXiv Detail & Related papers (2025-09-26T04:32:29Z)
ClimaQA: An Automated Evaluation Framework for Climate Question Answering Models [38.05357439484919]
We develop ClimaGen, an adaptive learning framework that generates question-answer pairs from graduate textbooks with climate scientists in the loop.<n>We present ClimaQA-Gold, an expert-annotated benchmark dataset alongside ClimaQA-Silver, a large-scale, comprehensive synthetic QA dataset for climate science.
arXiv Detail & Related papers (2024-10-22T05:12:19Z)
Learning to Generate and Evaluate Fact-checking Explanations with Transformers [10.970249299147866]
Research contributes to the field of Explainable Artificial Antelligence (XAI) We develop transformer-based fact-checking models that contextualise and justify their decisions by generating human-accessible explanations. We emphasise the need for aligning Artificial Intelligence (AI)-generated explanations with human judgements.
arXiv Detail & Related papers (2024-10-21T06:22:51Z)
Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly. Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness. Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings. This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
What Does My QA Model Know? Devising Controlled Probes using Expert Knowledge [36.13528043657398]
We investigate whether state-of-the-art QA models have general knowledge about word definitions and general taxonomic reasoning. We use a methodology for automatically building datasets from various types of expert knowledge. Our evaluation confirms that transformer-based QA models are already predisposed to recognize certain types of structural lexical knowledge.
arXiv Detail & Related papers (2019-12-31T15:05:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.