ESGBench: A Benchmark for Explainable ESG Question Answering in Corporate Sustainability Reports
- URL: http://arxiv.org/abs/2511.16438v1
- Date: Thu, 20 Nov 2025 15:07:17 GMT
- Title: ESGBench: A Benchmark for Explainable ESG Question Answering in Corporate Sustainability Reports
- Authors: Sherine George, Nithish Saji,
- Abstract summary: We present ESGBench, a benchmark dataset and evaluation framework designed to assess explainable ESG question answering systems.<n>The benchmark consists of domain-grounded questions across multiple ESG themes, paired with human-curated answers and supporting evidence.<n>We analyze the performance of state-of-the-art LLMs on ESGBench, highlighting key challenges in factual consistency, traceability, and domain alignment.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present ESGBench, a benchmark dataset and evaluation framework designed to assess explainable ESG question answering systems using corporate sustainability reports. The benchmark consists of domain-grounded questions across multiple ESG themes, paired with human-curated answers and supporting evidence to enable fine-grained evaluation of model reasoning. We analyze the performance of state-of-the-art LLMs on ESGBench, highlighting key challenges in factual consistency, traceability, and domain alignment. ESGBench aims to accelerate research in transparent and accountable ESG-focused AI systems.
Related papers
- GISA: A Benchmark for General Information-Seeking Assistant [102.30831921333755]
GISA is a benchmark for General Information-Seeking Assistants comprising 373 human-crafted queries.<n>It integrates both deep reasoning and broad information aggregation within unified tasks, and includes a live subset with periodically updated answers to resist memorization.<n>Experiments on mainstream LLMs and commercial search products reveal that even the best-performing model achieves only 19.30% exact match score.
arXiv Detail & Related papers (2026-02-09T11:44:15Z) - Advancing ESG Intelligence: An Expert-level Agent and Comprehensive Benchmark for Sustainable Finance [21.31987959023507]
We introduce ESGAgent, a hierarchical multi-agent system empowered by a specialized toolset to generate in-depth ESG analysis.<n>We present a benchmark derived from 310 corporate sustainability reports, designed to evaluate capabilities ranging from atomic common-sense questions to the generation of integrated, in-depth analysis.
arXiv Detail & Related papers (2026-01-13T15:58:29Z) - LiveRAG: A diverse Q&A dataset with varying difficulty level for RAG evaluation [12.341210252539776]
We introduce the LiveRAG benchmark, a dataset of 895 synthetic questions and answers designed to support systematic evaluation of RAG-based Q&A systems.<n>This synthetic benchmark is derived from the one used during the SIGIR'2025 LiveRAG Challenge, where competitors were evaluated under strict time constraints.<n>Our analysis highlights the benchmark's questions diversity, the wide range of their difficulty levels, and their usefulness in differentiating between system capabilities.
arXiv Detail & Related papers (2025-11-18T14:34:35Z) - Knowledge-Graph Based RAG System Evaluation Framework [27.082302648704708]
Large language models (LLMs) has become a significant research focus.<n>Retrieval Augmented Generation (RAG) greatly enhances generated content's reliability and relevance.<n> evaluating RAG systems remains a challenging task.
arXiv Detail & Related papers (2025-10-02T20:36:21Z) - Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs [69.10441885629787]
Retrieval-Augmented Generation (RAG) lifts the factuality of Large Language Models (LLMs) by injecting external knowledge.<n>It falls short on problems that demand multi-step inference; conversely, purely reasoning-oriented approaches often hallucinate or mis-ground facts.<n>This survey synthesizes both strands under a unified reasoning-retrieval perspective.
arXiv Detail & Related papers (2025-07-13T03:29:41Z) - ESGenius: Benchmarking LLMs on Environmental, Social, and Governance (ESG) and Sustainability Knowledge [40.49917730563565]
ESGenius is a comprehensive benchmark for evaluating and enhancing the proficiency of Large Language Models (LLMs) in Environmental, Social, and Governance (ESG)<n> ESGenius comprises two key components: (i) ESGenius-QA, a collection of 1,136 Multiple-Choice Questions (MCQs) generated by LLMs and rigorously validated by domain experts, covering a broad range of ESG pillars and sustainability topics; and (ii) ESGenius-Corpus, a meticulously curated repository of 231 foundational frameworks, standards, reports, and recommendation documents from 7 authoritative sources.
arXiv Detail & Related papers (2025-06-02T13:19:09Z) - ESGSenticNet: A Neurosymbolic Knowledge Base for Corporate Sustainability Analysis [36.5158422340267]
We introduce ESGSenticNet, a knowledge base for sustainability analysis.<n> ESGSenticNet is constructed from a neurosymbolic framework that integrates specialised concept parsing, GPT-4o inference, and semi-supervised label propagation.<n> Experiments indicate that ESGSenticNet, when deployed as a lexical method, more effectively captures relevant and actionable sustainability information.
arXiv Detail & Related papers (2025-01-27T01:21:12Z) - Unanswerability Evaluation for Retrieval Augmented Generation [74.3022365715597]
UAEval4RAG is a framework designed to evaluate whether RAG systems can handle unanswerable queries effectively.<n>We define a taxonomy with six unanswerable categories, and UAEval4RAG automatically synthesizes diverse and challenging queries.
arXiv Detail & Related papers (2024-12-16T19:11:55Z) - An Adaptive Framework for Generating Systematic Explanatory Answer in Online Q&A Platforms [62.878616839799776]
We propose SynthRAG, an innovative framework designed to enhance Question Answering (QA) performance.
SynthRAG improves on conventional models by employing adaptive outlines for dynamic content structuring.
An online deployment on the Zhihu platform revealed that SynthRAG's answers achieved notable user engagement.
arXiv Detail & Related papers (2024-10-23T09:14:57Z) - Trustworthiness in Retrieval-Augmented Generation Systems: A Survey [59.26328612791924]
Retrieval-Augmented Generation (RAG) has quickly grown into a pivotal paradigm in the development of Large Language Models (LLMs)
We propose a unified framework that assesses the trustworthiness of RAG systems across six key dimensions: factuality, robustness, fairness, transparency, accountability, and privacy.
arXiv Detail & Related papers (2024-09-16T09:06:44Z) - Iterative Utility Judgment Framework via LLMs Inspired by Relevance in Philosophy [66.95501113584541]
We propose an Iterative utiliTy judgm fraEntMework (ITEM) to promote each step in Retrieval-Augmented Generation (RAG)<n>RAG's three core components -- relevance ranking derived from retrieval models, utility judgments, and answer generation -- align with Schutz's philosophical system of relevances.<n> Experimental results demonstrate significant improvements of ITEM in utility judgments, ranking, and answer generation upon representative baselines.
arXiv Detail & Related papers (2024-06-17T07:52:42Z) - Evaluation of Retrieval-Augmented Generation: A Survey [13.633909177683462]
We provide a comprehensive overview of the evaluation and benchmarks of Retrieval-Augmented Generation (RAG) systems.
Specifically, we examine and compare several quantifiable metrics of the Retrieval and Generation components, such as relevance, accuracy, and faithfulness.
We then analyze the various datasets and metrics, discuss the limitations of current benchmarks, and suggest potential directions to advance the field of RAG benchmarks.
arXiv Detail & Related papers (2024-05-13T02:33:25Z) - OATS: Opinion Aspect Target Sentiment Quadruple Extraction Dataset for
Aspect-Based Sentiment Analysis [55.61047894397937]
Aspect-based sentiment analysis (ABSA) delves into understanding sentiments specific to distinct elements within a user-generated review.
We introduce the OATS dataset, which encompasses three fresh domains and consists of 27,470 sentence-level quadruples and 17,092 review-levels.
Our initiative seeks to bridge specific observed gaps: the recurrent focus on familiar domains like restaurants and laptops, limited data for intricate quadruple extraction tasks, and an occasional oversight of the synergy between sentence and review-level sentiments.
arXiv Detail & Related papers (2023-09-23T07:39:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.