An Automated LLM-based Pipeline for Asset-Level Database Creation to Assess Deforestation Impact
- URL: http://arxiv.org/abs/2505.05494v1
- Date: Mon, 05 May 2025 15:17:27 GMT
- Title: An Automated LLM-based Pipeline for Asset-Level Database Creation to Assess Deforestation Impact
- Authors: Avanija Menon, Ovidiu Serban,
- Abstract summary: The European Union Deforestation Regulation (EUDR) requires companies to prove their products do not contribute to deforestation.<n>Current databases lack the necessary detail, relying heavily on broad financial metrics and manual data collection.<n>This study presents an automated, end-to-end data extraction pipeline that uses LLMs to create, clean, and validate structured databases.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The European Union Deforestation Regulation (EUDR) requires companies to prove their products do not contribute to deforestation, creating a critical demand for precise, asset-level environmental impact data. Current databases lack the necessary detail, relying heavily on broad financial metrics and manual data collection, which limits regulatory compliance and accurate environmental modeling. This study presents an automated, end-to-end data extraction pipeline that uses LLMs to create, clean, and validate structured databases, specifically targeting sectors with a high risk of deforestation. The pipeline introduces Instructional, Role-Based, Zero-Shot Chain-of-Thought (IRZ-CoT) prompting to enhance data extraction accuracy and a Retrieval-Augmented Validation (RAV) process that integrates real-time web searches for improved data reliability. Applied to SEC EDGAR filings in the Mining, Oil & Gas, and Utilities sectors, the pipeline demonstrates significant improvements over traditional zero-shot prompting approaches, particularly in extraction accuracy and validation coverage. This work advances NLP-driven automation for regulatory compliance, CSR (Corporate Social Responsibility), and ESG, with broad sectoral applicability.
Related papers
- Engineering Trustworthy Machine-Learning Operations with Zero-Knowledge Proofs [1.7723990552388873]
Zero-Knowledge Proofs (ZKPs) offer a cryptographic solution that enables provers to demonstrate, through verified computations, adherence to set requirements without revealing sensitive model details or data.<n>We identify five key properties (non-interactivity, transparent setup, standard representations, succinctness, and post-quantum security) critical for their application in AI validation and verification pipelines.
arXiv Detail & Related papers (2025-05-26T15:39:11Z) - Privacy-Preserving Federated Embedding Learning for Localized Retrieval-Augmented Generation [60.81109086640437]
We propose a novel framework called Federated Retrieval-Augmented Generation (FedE4RAG)<n>FedE4RAG facilitates collaborative training of client-side RAG retrieval models.<n>We apply homomorphic encryption within federated learning to safeguard model parameters.
arXiv Detail & Related papers (2025-04-27T04:26:02Z) - TVR: Automotive System Requirement Traceability Validation and Recovery Through Retrieval-Augmented Generation [7.50061902435987]
Traceability between stakeholder requirements and system requirements is crucial to ensure consistency, correctness, and regulatory compliance.<n>Existing approaches do not address traceability between stakeholder and system requirements, rely on open-source data, and do not address the validation of manual links established by engineers.<n>We introduce TVR, a requirement Traceability Validation and Recovery approach primarily targeting automotive systems.
arXiv Detail & Related papers (2025-04-21T20:37:23Z) - Lawful and Accountable Personal Data Processing with GDPR-based Access and Usage Control in Distributed Systems [0.0]
This paper proposes a case-generic method for automated normative reasoning that establishes legal arguments for the lawfulness of data processing activities.<n>The arguments are established on the basis of case-specific legal qualifications made by privacy experts, bringing the human in the loop.<n>The resulting system is designed and critically assessed in reference to requirements extracted from the GPDR.
arXiv Detail & Related papers (2025-03-10T10:49:34Z) - Optimizing Product Provenance Verification using Data Valuation Methods [24.59951827145763]
We introduce a novel data valuation framework designed to enhance the selection and utilization of training data for machine learning models applied in Stable Isotope Ratio Analysis (SIRA)<n>We validate our methodology with extensive experiments, demonstrating its potential to significantly enhance provenance verification, mitigate fraudulent trade practices, and strengthen regulatory enforcement of global supply chains.
arXiv Detail & Related papers (2025-02-21T03:16:19Z) - The Dual-use Dilemma in LLMs: Do Empowering Ethical Capacities Make a Degraded Utility? [54.18519360412294]
Large Language Models (LLMs) must balance between rejecting harmful requests for safety and accommodating legitimate ones for utility.<n>This paper presents a Direct Preference Optimization (DPO) based alignment framework that achieves better overall performance.<n>We analyze experimental results obtained from testing DeepSeek-R1 on our benchmark and reveal the critical ethical concerns raised by this highly acclaimed model.
arXiv Detail & Related papers (2025-01-20T06:35:01Z) - Climate AI for Corporate Decarbonization Metrics Extraction [7.522638089716454]
We introduce the Climate Artificial Intelligence for Corporate Decarbonization Metrics Extraction (CAI) model and pipeline.
We demonstrate that the process improves data collection efficiency and accuracy by automating data curation, validation, and metric scoring from public corporate disclosures.
arXiv Detail & Related papers (2024-11-05T18:37:51Z) - Enhancing Legal Compliance and Regulation Analysis with Large Language Models [0.0]
This research explores the application of Large Language Models (LLMs) to accurately classify legal provisions and automate compliance checks.
Our findings demonstrate promising results, indicating LLMs' significant potential to enhance legal compliance and regulatory analysis efficiency, notably by reducing manual workload and improving accuracy within reasonable time financial constraints.
arXiv Detail & Related papers (2024-04-26T16:40:49Z) - Better Practices for Domain Adaptation [62.70267990659201]
Domain adaptation (DA) aims to provide frameworks for adapting models to deployment data without using labels.
Unclear validation protocol for DA has led to bad practices in the literature.
We show challenges across all three branches of domain adaptation methodology.
arXiv Detail & Related papers (2023-09-07T17:44:18Z) - Uncertainty-Aware Source-Free Adaptive Image Super-Resolution with Wavelet Augmentation Transformer [60.31021888394358]
Unsupervised Domain Adaptation (UDA) can effectively address domain gap issues in real-world image Super-Resolution (SR)
We propose a SOurce-free Domain Adaptation framework for image SR (SODA-SR) to address this issue, i.e., adapt a source-trained model to a target domain with only unlabeled target data.
arXiv Detail & Related papers (2023-03-31T03:14:44Z) - Privacy Adhering Machine Un-learning in NLP [66.17039929803933]
In real world industry use Machine Learning to build models on user data.
Such mandates require effort both in terms of data as well as model retraining.
continuous removal of data and model retraining steps do not scale.
We propose textitMachine Unlearning to tackle this challenge.
arXiv Detail & Related papers (2022-12-19T16:06:45Z) - Provably Efficient Causal Reinforcement Learning with Confounded
Observational Data [135.64775986546505]
We study how to incorporate the dataset (observational data) collected offline, which is often abundantly available in practice, to improve the sample efficiency in the online setting.
We propose the deconfounded optimistic value iteration (DOVI) algorithm, which incorporates the confounded observational data in a provably efficient manner.
arXiv Detail & Related papers (2020-06-22T14:49:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.