Related papers: ClimRetrieve: A Benchmarking Dataset for Information Retrieval from Corporate Climate Disclosures

ClimRetrieve: A Benchmarking Dataset for Information Retrieval from Corporate Climate Disclosures

URL: http://arxiv.org/abs/2406.09818v3
Date: Tue, 01 Oct 2024 08:55:44 GMT
Title: ClimRetrieve: A Benchmarking Dataset for Information Retrieval from Corporate Climate Disclosures
Authors: Tobias Schimanski, Jingwei Ni, Roberto Spacey, Nicola Ranger, Markus Leippold,
Abstract summary: This work simulates the typical tasks of a sustainability analyst by examining 30 sustainability reports with 16 detailed climate-related questions. We obtain a dataset with over 8.5K unique question-source-answer pairs labeled by different levels of relevance. We develop a use case with the dataset to investigate the integration of expert knowledge into information retrieval with embeddings.
Score: 3.348779089844034
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: To handle the vast amounts of qualitative data produced in corporate climate communication, stakeholders increasingly rely on Retrieval Augmented Generation (RAG) systems. However, a significant gap remains in evaluating domain-specific information retrieval - the basis for answer generation. To address this challenge, this work simulates the typical tasks of a sustainability analyst by examining 30 sustainability reports with 16 detailed climate-related questions. As a result, we obtain a dataset with over 8.5K unique question-source-answer pairs labeled by different levels of relevance. Furthermore, we develop a use case with the dataset to investigate the integration of expert knowledge into information retrieval with embeddings. Although we show that incorporating expert knowledge works, we also outline the critical limitations of embeddings in knowledge-intensive downstream domains like climate change communication.

Related papers

Climate-Eval: A Comprehensive Benchmark for NLP Tasks Related to Climate Change [2.8680187920555635]
Climate-Eval aggregates existing datasets along with a newly developed news classification dataset.<n>This results in a benchmark of 25 tasks based on 13 datasets, covering key aspects of climate discourse.
arXiv Detail & Related papers (2025-05-24T11:45:46Z)
InfoDeepSeek: Benchmarking Agentic Information Seeking for Retrieval-Augmented Generation [63.55258191625131]
InfoDeepSeek is a new benchmark for assessing agentic information seeking in real-world, dynamic web environments.<n>We propose a systematic methodology for constructing challenging queries satisfying the criteria of determinacy, difficulty, and diversity.<n>We develop the first evaluation framework tailored to dynamic agentic information seeking, including fine-grained metrics about the accuracy, utility, and compactness of information seeking outcomes.
arXiv Detail & Related papers (2025-05-21T14:44:40Z)
Beyond the Destination: A Novel Benchmark for Exploration-Aware Embodied Question Answering [87.76784654371312]
Embodied Question Answering requires agents to dynamically explore 3D environments, actively gather visual information, and perform multi-step reasoning to answer questions. Existing datasets often introduce biases or prior knowledge, leading to disembodied reasoning. We construct the largest dataset designed specifically to evaluate both exploration and reasoning capabilities.
arXiv Detail & Related papers (2025-03-14T06:29:47Z)
Multi-Source Knowledge Pruning for Retrieval-Augmented Generation: A Benchmark and Empirical Study [46.55831783809377]
Retrieval-augmented generation (RAG) is increasingly recognized as an effective approach to mitigating the hallucination of large language models (LLMs) We develop PruningRAG, a plug-and-play RAG framework that uses multi-granularity pruning strategies to more effectively incorporate relevant context and mitigate the negative impact of misleading information.
arXiv Detail & Related papers (2024-09-03T03:31:37Z)
Building Better Datasets: Seven Recommendations for Responsible Design from Dataset Creators [0.5755004576310334]
We interviewed 18 leading dataset creators about the current state of the field. We shed light on the challenges and considerations faced by dataset creators. We share seven central recommendations for improving responsible dataset creation.
arXiv Detail & Related papers (2024-08-30T20:52:19Z)
MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
We present a comprehensive dataset compiled from Nature Communications articles covering 72 scientific fields. We evaluated 19 proprietary and open-source models on two benchmark tasks, figure captioning and multiple-choice, and conducted human expert annotation. Fine-tuning Qwen2-VL-7B with our task-specific data achieved better performance than GPT-4o and even human experts in multiple-choice evaluations.
arXiv Detail & Related papers (2024-07-06T00:40:53Z)
Towards Robust Evaluation: A Comprehensive Taxonomy of Datasets and Metrics for Open Domain Question Answering in the Era of Large Language Models [0.0]
Open Domain Question Answering (ODQA) within natural language processing involves building systems that answer factual questions using large-scale knowledge corpora. High-quality datasets are used to train models on realistic scenarios. Standardized metrics facilitate comparisons between different ODQA systems.
arXiv Detail & Related papers (2024-06-19T05:43:02Z)
Analyzing Human Questioning Behavior and Causal Curiosity through Natural Queries [91.70689724416698]
We present NatQuest, a collection of 13,500 naturally occurring questions from three diverse sources. Our analysis reveals a significant presence of causal questions (up to 42%) within the dataset.
arXiv Detail & Related papers (2024-05-30T17:55:28Z)
Data Collection of Real-Life Knowledge Work in Context: The RLKWiC Dataset [4.388282062290401]
This paper presents RLKWiC, a novel dataset of Real-Life Knowledge Work in Context. RLKWiC is the first publicly available dataset offering a wealth of essential information dimensions.
arXiv Detail & Related papers (2024-04-16T12:23:59Z)
Automatic Question-Answer Generation for Long-Tail Knowledge [65.11554185687258]
We propose an automatic approach to generate specialized QA datasets for tail entities. We conduct extensive experiments by employing pretrained LLMs on our newly generated long-tail QA datasets.
arXiv Detail & Related papers (2024-03-03T03:06:31Z)
Query of CC: Unearthing Large Scale Domain-Specific Knowledge from Public Corpora [104.16648246740543]
We propose an efficient data collection method based on large language models. The method bootstraps seed information through a large language model and retrieves related data from public corpora. It not only collects knowledge-related data for specific domains but unearths the data with potential reasoning procedures.
arXiv Detail & Related papers (2024-01-26T03:38:23Z)
UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA. We first augment the existing data via deliberate perturbations on either the image or question. We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z)
ExpertQA: Expert-Curated Questions and Attributed Answers [51.68314045809179]
We conduct human evaluation of responses from a few representative systems along various axes of attribution and factuality. We collect expert-curated questions from 484 participants across 32 fields of study, and then ask the same experts to evaluate generated responses to their own questions. The output of our analysis is ExpertQA, a high-quality long-form QA dataset with 2177 questions spanning 32 fields, along with verified answers and attributions for claims in the answers.
arXiv Detail & Related papers (2023-09-14T16:54:34Z)
Collect, Measure, Repeat: Reliability Factors for Responsible AI Data Collection [8.12993269922936]
We argue that data collection for AI should be performed in a responsible manner. We propose a Responsible AI (RAI) methodology designed to guide the data collection with a set of metrics.
arXiv Detail & Related papers (2023-08-22T18:01:27Z)
Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive Framework [51.44863255495668]
Multimodal reasoning is a critical component in the pursuit of artificial intelligence systems that exhibit human-like intelligence. We present Multi-Modal Reasoning(COCO-MMR) dataset, a novel dataset that encompasses an extensive collection of open-ended questions. We propose innovative techniques, including multi-hop cross-modal attention and sentence-level contrastive learning, to enhance the image and text encoders.
arXiv Detail & Related papers (2023-07-24T08:58:25Z)
CLIMATE-FEVER: A Dataset for Verification of Real-World Climate Claims [4.574830585715129]
We introduce CLIMATE-FEVER, a new dataset for verification of climate change-related claims. We adapt the methodology of FEVER [1], the largest dataset of artificially designed claims, to real-life claims collected from the Internet. We discuss the surprising, subtle complexity of modeling real-world climate-related claims within the textscfever framework.
arXiv Detail & Related papers (2020-12-01T16:32:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.