Related papers: Automated Analysis of Sustainability Reports: Using Large Language Models for the Extraction and Prediction of EU Taxonomy-Compliant KPIs

Automated Analysis of Sustainability Reports: Using Large Language Models for the Extraction and Prediction of EU Taxonomy-Compliant KPIs

URL: http://arxiv.org/abs/2512.24289v1
Date: Tue, 30 Dec 2025 15:28:03 GMT
Title: Automated Analysis of Sustainability Reports: Using Large Language Models for the Extraction and Prediction of EU Taxonomy-Compliant KPIs
Authors: Jonathan Schmoll, Adam Jatowt,
Abstract summary: Large Language Models (LLMs) offer a path to automation.<n>We introduce a novel, structured dataset from 190 corporate reports.<n>Our results reveal a clear performance gap between qualitative and quantitative tasks.
Score: 21.656551146954587
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The manual, resource-intensive process of complying with the EU Taxonomy presents a significant challenge for companies. While Large Language Models (LLMs) offer a path to automation, research is hindered by a lack of public benchmark datasets. To address this gap, we introduce a novel, structured dataset from 190 corporate reports, containing ground-truth economic activities and quantitative Key Performance Indicators (KPIs). We use this dataset to conduct the first systematic evaluation of LLMs on the core compliance workflow. Our results reveal a clear performance gap between qualitative and quantitative tasks. LLMs show moderate success in the qualitative task of identifying economic activities, with a multi-step agentic framework modestly enhancing precision. Conversely, the models comprehensively fail at the quantitative task of predicting financial KPIs in a zero-shot setting. We also discover a paradox, where concise metadata often yields superior performance to full, unstructured reports, and find that model confidence scores are poorly calibrated. We conclude that while LLMs are not ready for full automation, they can serve as powerful assistive tools for human experts. Our dataset provides a public benchmark for future research.

Related papers

Benchmarking LLM Agents for Wealth-Management Workflows [0.0]
This dissertation extends TheAgentCompany with a finance-focused environment.<n>It investigates whether a general purpose LLM agent can complete representative wealth-management tasks both accurately and economically.
arXiv Detail & Related papers (2025-12-01T21:56:21Z)
How AI Forecasts AI Jobs: Benchmarking LLM Predictions of Labor Market Changes [5.848712585343904]
This paper introduces a benchmark for evaluating how well large language models (LLMs) can anticipate changes in job demand.<n>Our benchmark combines two datasets: a high-frequency index of sector-level job postings in the United States, and a global dataset of projected occupational changes due to AI adoption.<n>Results show that structured task prompts consistently improve forecast stability, while persona prompts offer advantages on short-term trends.
arXiv Detail & Related papers (2025-10-27T14:08:27Z)
A Comparative Benchmark of Large Language Models for Labelling Wind Turbine Maintenance Logs [0.0]
This paper presents a framework for benchmarking Large Language Models (LLMs) on the task of classifying complex industrial records.<n>To promote transparency and encourage further research, this framework has been made publicly available as an open-source tool.<n>We quantify a clear performance hierarchy, identifying top models that exhibit high alignment with a benchmark standard and trustworthy, well-calibrated confidence scores.
arXiv Detail & Related papers (2025-09-08T15:48:17Z)
Quantifying Qualitative Insights: Leveraging LLMs to Market Predict [0.0]
This study addresses challenges by leveraging daily reports from securities firms to create high-quality contextual information. The reports are segmented into text-based key factors and combined with numerical data, such as price information, to form context sets. A crafted prompt is designed to assign scores to the key factors, converting qualitative insights into quantitative results.
arXiv Detail & Related papers (2024-11-13T07:45:40Z)
Evaluating Large Language Models on Financial Report Summarization: An Empirical Study [9.28042182186057]
We conduct a comparative study on three state-of-the-art Large Language Models (LLMs) Our primary motivation is to explore how these models can be harnessed within finance, a field demanding precision, contextual relevance, and robustness against erroneous or misleading information. We introduce an innovative evaluation framework that integrates both quantitative metrics (e.g., precision, recall) and qualitative analyses (e.g., contextual fit, consistency) to provide a holistic view of each model's output quality.
arXiv Detail & Related papers (2024-11-11T10:36:04Z)
Context is Key: A Benchmark for Forecasting with Essential Textual Information [87.3175915185287]
"Context is Key" (CiK) is a forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context.<n>We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters.<n>We propose a simple yet effective LLM prompting method that outperforms all other tested methods on our benchmark.
arXiv Detail & Related papers (2024-10-24T17:56:08Z)
Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios. We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples. Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z)
Advancing Anomaly Detection: Non-Semantic Financial Data Encoding with LLMs [49.57641083688934]
We introduce a novel approach to anomaly detection in financial data using Large Language Models (LLMs) embeddings. Our experiments demonstrate that LLMs contribute valuable information to anomaly detection as our models outperform the baselines.
arXiv Detail & Related papers (2024-06-05T20:19:09Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning [52.257422715393574]
We introduce a self-guided methodology for Large Language Models (LLMs) to autonomously discern and select cherry samples from open-source datasets. Our key innovation, the Instruction-Following Difficulty (IFD) metric, emerges as a pivotal metric to identify discrepancies between a model's expected responses and its intrinsic generation capability.
arXiv Detail & Related papers (2023-08-23T09:45:29Z)
Discover, Explanation, Improvement: An Automatic Slice Detection Framework for Natural Language Processing [72.14557106085284]
slice detection models (SDM) automatically identify underperforming groups of datapoints. This paper proposes a benchmark named "Discover, Explain, improve (DEIM)" for classification NLP tasks. Our evaluation shows that Edisa can accurately select error-prone datapoints with informative semantic features.
arXiv Detail & Related papers (2022-11-08T19:00:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.