Advanced Unstructured Data Processing for ESG Reports: A Methodology for
Structured Transformation and Enhanced Analysis
- URL: http://arxiv.org/abs/2401.02992v1
- Date: Thu, 4 Jan 2024 06:26:59 GMT
- Title: Advanced Unstructured Data Processing for ESG Reports: A Methodology for
Structured Transformation and Enhanced Analysis
- Authors: Jiahui Peng, Jing Gao, Xin Tong, Jing Guo, Hang Yang, Jianchuan Qi,
Ruiqiao Li, Nan Li, Ming Xu
- Abstract summary: This study introduces an innovative methodology to transform ESG reports into structured, analyzable formats.
Our approach offers high-precision text cleaning, adept identification and extraction of text from images, and standardization of tables within these reports.
This research marks a substantial contribution to the fields of industrial ecology and corporate sustainability assessment.
- Score: 20.038120319271773
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the evolving field of corporate sustainability, analyzing unstructured
Environmental, Social, and Governance (ESG) reports is a complex challenge due
to their varied formats and intricate content. This study introduces an
innovative methodology utilizing the "Unstructured Core Library", specifically
tailored to address these challenges by transforming ESG reports into
structured, analyzable formats. Our approach significantly advances the
existing research by offering high-precision text cleaning, adept
identification and extraction of text from images, and standardization of
tables within these reports. Emphasizing its capability to handle diverse data
types, including text, images, and tables, the method adeptly manages the
nuances of differing page layouts and report styles across industries. This
research marks a substantial contribution to the fields of industrial ecology
and corporate sustainability assessment, paving the way for the application of
advanced NLP technologies and large language models in the analysis of
corporate governance and sustainability. Our code is available at
https://github.com/linancn/TianGong-AI-Unstructure.git.
Related papers
- Retrieval-Enhanced Machine Learning: Synthesis and Opportunities [60.34182805429511]
Retrieval-enhancement can be extended to a broader spectrum of machine learning (ML)
This work introduces a formal framework of this paradigm, Retrieval-Enhanced Machine Learning (REML), by synthesizing the literature in various domains in ML with consistent notations which is missing from the current literature.
The goal of this work is to equip researchers across various disciplines with a comprehensive, formally structured framework of retrieval-enhanced models, thereby fostering interdisciplinary future research.
arXiv Detail & Related papers (2024-07-17T20:01:21Z) - Towards Unified Multi-granularity Text Detection with Interactive Attention [56.79437272168507]
"Detect Any Text" is an advanced paradigm that unifies scene text detection, layout analysis, and document page detection into a cohesive, end-to-end model.
A pivotal innovation in DAT is the across-granularity interactive attention module, which significantly enhances the representation learning of text instances.
Tests demonstrate that DAT achieves state-of-the-art performances across a variety of text-related benchmarks.
arXiv Detail & Related papers (2024-05-30T07:25:23Z) - A Survey on Retrieval-Augmented Text Generation for Large Language Models [1.4579344926652844]
Retrieval-Augmented Generation (RAG) merges retrieval methods with deep learning advancements.
This paper organizes the RAG paradigm into four categories: pre-retrieval, retrieval, post-retrieval, and generation.
It outlines RAG's evolution and discusses the field's progression through the analysis of significant studies.
arXiv Detail & Related papers (2024-04-17T01:27:42Z) - Visual Analytics for Fine-grained Text Classification Models and Datasets [3.6873612681664016]
SemLa is a novel visual analytics system tailored for fine-grained text classification.
This paper details the iterative design study and the resulting innovations featured in SemLa.
arXiv Detail & Related papers (2024-03-21T17:26:28Z) - Generative AI in the Construction Industry: A State-of-the-art Analysis [0.4241054493737716]
There is a gap in the literature on the current state, opportunities, and challenges of generative AI in the construction industry.
This study aims to review and categorize the existing and emerging generative AI opportunities and challenges in the construction industry.
It proposes a framework for construction firms to build customized generative AI solutions using their own data.
arXiv Detail & Related papers (2024-02-15T13:39:55Z) - Contextualization Distillation from Large Language Model for Knowledge
Graph Completion [51.126166442122546]
We introduce the Contextualization Distillation strategy, a plug-in-and-play approach compatible with both discriminative and generative KGC frameworks.
Our method begins by instructing large language models to transform compact, structural triplets into context-rich segments.
Comprehensive evaluations across diverse datasets and KGC techniques highlight the efficacy and adaptability of our approach.
arXiv Detail & Related papers (2024-01-28T08:56:49Z) - Text2Analysis: A Benchmark of Table Question Answering with Advanced
Data Analysis and Unclear Queries [67.0083902913112]
We develop the Text2Analysis benchmark, incorporating advanced analysis tasks.
We also develop five innovative and effective annotation methods.
We evaluate five state-of-the-art models using three different metrics.
arXiv Detail & Related papers (2023-12-21T08:50:41Z) - Incremental hierarchical text clustering methods: a review [49.32130498861987]
This study aims to analyze various hierarchical and incremental clustering techniques.
The main contribution of this research is the organization and comparison of the techniques used by studies published between 2010 and 2018 that aimed to texts documents clustering.
arXiv Detail & Related papers (2023-12-12T22:27:29Z) - Cognitive Computing to Optimize IT Services [0.0]
A Cognitive solution goes beyond the traditional structured data analysis by deep analyses of both structured and unstructured text.
In experiments, upto 18-25% of yearly ticket volume has been reduced using the proposed approach.
arXiv Detail & Related papers (2021-12-28T09:56:44Z) - A Dependency Syntactic Knowledge Augmented Interactive Architecture for
End-to-End Aspect-based Sentiment Analysis [73.74885246830611]
We propose a novel dependency syntactic knowledge augmented interactive architecture with multi-task learning for end-to-end ABSA.
This model is capable of fully exploiting the syntactic knowledge (dependency relations and types) by leveraging a well-designed Dependency Relation Embedded Graph Convolutional Network (DreGcn)
Extensive experimental results on three benchmark datasets demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2020-04-04T14:59:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.