Related papers: Advancing ESG Intelligence: An Expert-level Agent and Comprehensive Benchmark for Sustainable Finance

Advancing ESG Intelligence: An Expert-level Agent and Comprehensive Benchmark for Sustainable Finance

URL: http://arxiv.org/abs/2601.08676v2
Date: Wed, 14 Jan 2026 06:29:01 GMT
Title: Advancing ESG Intelligence: An Expert-level Agent and Comprehensive Benchmark for Sustainable Finance
Authors: Yilei Zhao, Wentao Zhang, Lei Xiao, Yandan Zheng, Mengpu Liu, Wei Yang Bryan Lim,
Abstract summary: We introduce ESGAgent, a hierarchical multi-agent system empowered by a specialized toolset to generate in-depth ESG analysis.<n>We present a benchmark derived from 310 corporate sustainability reports, designed to evaluate capabilities ranging from atomic common-sense questions to the generation of integrated, in-depth analysis.
Score: 21.31987959023507
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Environmental, social, and governance (ESG) criteria are essential for evaluating corporate sustainability and ethical performance. However, professional ESG analysis is hindered by data fragmentation across unstructured sources, and existing large language models (LLMs) often struggle with the complex, multi-step workflows required for rigorous auditing. To address these limitations, we introduce ESGAgent, a hierarchical multi-agent system empowered by a specialized toolset, including retrieval augmentation, web search and domain-specific functions, to generate in-depth ESG analysis. Complementing this agentic system, we present a comprehensive three-level benchmark derived from 310 corporate sustainability reports, designed to evaluate capabilities ranging from atomic common-sense questions to the generation of integrated, in-depth analysis. Empirical evaluations demonstrate that ESGAgent outperforms state-of-the-art closed-source LLMs with an average accuracy of 84.15% on atomic question-answering tasks, and excels in professional report generation by integrating rich charts and verifiable references. These findings confirm the diagnostic value of our benchmark, establishing it as a vital testbed for assessing general and advanced agentic capabilities in high-stakes vertical domains.

Related papers

GISA: A Benchmark for General Information-Seeking Assistant [102.30831921333755]
GISA is a benchmark for General Information-Seeking Assistants comprising 373 human-crafted queries.<n>It integrates both deep reasoning and broad information aggregation within unified tasks, and includes a live subset with periodically updated answers to resist memorization.<n>Experiments on mainstream LLMs and commercial search products reveal that even the best-performing model achieves only 19.30% exact match score.
arXiv Detail & Related papers (2026-02-09T11:44:15Z)
LongDA: Benchmarking LLM Agents for Long-Document Data Analysis [55.32211515932351]
LongDA targets real-world settings in which navigating long documentation and complex data is the primary bottleneck.<n>LongTA is a tool-augmented agent framework that enables document access, retrieval, and code execution.<n>Our experiments reveal substantial performance gaps even among state-of-the-art models.
arXiv Detail & Related papers (2026-01-05T23:23:16Z)
DEER: A Comprehensive and Reliable Benchmark for Deep-Research Expert Reports [49.217247659479476]
deep research systems can generate expert-level reports via multi-step reasoning and evidence-based synthesis.<n>Existing benchmarks often lack systematic criteria for expert reporting.<n>We introduce DEER, a benchmark for evaluating expert-level deep research reports.
arXiv Detail & Related papers (2025-12-19T16:46:20Z)
CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency [60.83660377169452]
This paper introduces CryptoBench, the first expert-curated, dynamic benchmark designed to rigorously evaluate the real-world capabilities of Large Language Model (LLM) agents.<n>Unlike general-purpose agent benchmarks for search and prediction, professional crypto analysis presents specific challenges.
arXiv Detail & Related papers (2025-11-29T09:52:34Z)
Benchmarking LLM-based Agents for Single-cell Omics Analysis [6.915378212190715]
AI agents offer a paradigm shift, enabling adaptive planning, executable code generation, traceable decisions, and real-time knowledge fusion.<n>We introduce a novel benchmarking evaluation system to rigorously assess agent capabilities in single-cell omics analysis.
arXiv Detail & Related papers (2025-08-16T04:26:18Z)
Optimizing Large Language Models for ESG Activity Detection in Financial Texts [0.7373617024876725]
This paper investigates the ability of current-generation Large Language Models to identify text related to environmental activities.<n>We introduce ESG-Activities, a benchmark dataset containing 1,325 labelled text segments classified according to the EU ESG taxonomy.<n>Our experimental results show that fine-tuning on ESG-Activities significantly enhances classification accuracy.
arXiv Detail & Related papers (2025-02-28T14:52:25Z)
FinSphere, a Real-Time Stock Analysis Agent Powered by Instruction-Tuned LLMs and Domain Tools [7.6993069412225905]
Current financial large language models (FinLLMs) struggle with two critical limitations.<n>The absence of objective evaluation metrics to assess the quality of stock analysis reports and a lack of depth in stock analysis impedes their ability to generate professional-grade insights.<n>This paper introduces FinSphere, a stock analysis agent, along with three major contributions.
arXiv Detail & Related papers (2025-01-08T07:50:50Z)
InsightBench: Evaluating Business Analytics Agents Through Multi-Step Insight Generation [79.09622602860703]
We introduce InsightBench, a benchmark dataset with three key features.<n>It consists of 100 datasets representing diverse business use cases such as finance and incident management.<n>Unlike existing benchmarks focusing on answering single queries, InsightBench evaluates agents based on their ability to perform end-to-end data analytics.
arXiv Detail & Related papers (2024-07-08T22:06:09Z)
Can a GPT4-Powered AI Agent Be a Good Enough Performance Attribution Analyst? [0.0]
This study introduces the application of an AI Agent for a variety of essential performance attribution tasks. It achieves accuracy rates exceeding 93% in analyzing performance drivers, attains 100% in multi-level attribution calculations, and surpasses 84% accuracy in QA exercises that simulate official examination standards.
arXiv Detail & Related papers (2024-03-15T17:12:57Z)
AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents [74.16170899755281]
We introduce AgentBoard, a pioneering comprehensive benchmark and accompanied open-source evaluation framework tailored to analytical evaluation of LLM agents.<n>AgentBoard offers a fine-grained progress rate metric that captures incremental advancements as well as a comprehensive evaluation toolkit.<n>This not only sheds light on the capabilities and limitations of LLM agents but also propels the interpretability of their performance to the forefront.
arXiv Detail & Related papers (2024-01-24T01:51:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.