A Scalable Framework for Table of Contents Extraction from Complex ESG
Annual Reports
- URL: http://arxiv.org/abs/2310.18073v1
- Date: Fri, 27 Oct 2023 11:40:32 GMT
- Title: A Scalable Framework for Table of Contents Extraction from Complex ESG
Annual Reports
- Authors: Xinyu Wang, Lin Gui, Yulan He
- Abstract summary: We propose a new dataset, ESGDoc, comprising 1,093 ESG annual reports from 563 companies spanning from 2001 to 2022.
These reports pose significant challenges due to their diverse structures and extensive length.
We propose a new framework for Toc extraction, consisting of three steps.
- Score: 19.669390380593843
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Table of contents (ToC) extraction centres on structuring documents in a
hierarchical manner. In this paper, we propose a new dataset, ESGDoc,
comprising 1,093 ESG annual reports from 563 companies spanning from 2001 to
2022. These reports pose significant challenges due to their diverse structures
and extensive length. To address these challenges, we propose a new framework
for Toc extraction, consisting of three steps: (1) Constructing an initial tree
of text blocks based on reading order and font sizes; (2) Modelling each tree
node (or text block) independently by considering its contextual information
captured in node-centric subtree; (3) Modifying the original tree by taking
appropriate action on each tree node (Keep, Delete, or Move). This
construction-modelling-modification (CMM) process offers several benefits. It
eliminates the need for pairwise modelling of section headings as in previous
approaches, making document segmentation practically feasible. By incorporating
structured information, each section heading can leverage both local and
long-distance context relevant to itself. Experimental results show that our
approach outperforms the previous state-of-the-art baseline with a fraction of
running time. Our framework proves its scalability by effectively handling
documents of any length.
Related papers
- Graph-tree Fusion Model with Bidirectional Information Propagation for Long Document Classification [20.434941308959786]
Long document classification presents challenges due to their extensive content and complex structure.
Existing methods often struggle with token limits and fail to adequately model hierarchical relationships within documents.
Our approach integrates syntax trees for sentence encodings and document graphs for document encodings, which capture fine-grained syntactic relationships and broader document contexts.
arXiv Detail & Related papers (2024-10-03T19:25:01Z) - From Text Segmentation to Smart Chaptering: A Novel Benchmark for
Structuring Video Transcriptions [63.11097464396147]
We introduce a novel benchmark YTSeg focusing on spoken content that is inherently more unstructured and both topically and structurally diverse.
We also introduce an efficient hierarchical segmentation model MiniSeg, that outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2024-02-27T15:59:37Z) - Leveraging Collection-Wide Similarities for Unsupervised Document Structure Extraction [61.998789448260005]
We propose to identify the typical structure of document within a collection.
We abstract over arbitrary header paraphrases, and ground each topic to respective document locations.
We develop an unsupervised graph-based method which leverages both inter- and intra-document similarities.
arXiv Detail & Related papers (2024-02-21T16:22:21Z) - UniVIE: A Unified Label Space Approach to Visual Information Extraction
from Form-like Documents [11.761942458294136]
We present a new perspective, reframing VIE as a relation prediction problem and unifying labels of different tasks into a single label space.
This unified approach allows for the definition of various relation types and effectively tackles hierarchical relationships in form-like documents.
We present UniVIE, a unified model that addresses the VIE problem comprehensively.
arXiv Detail & Related papers (2024-01-17T14:02:36Z) - Doc2SoarGraph: Discrete Reasoning over Visually-Rich Table-Text
Documents via Semantic-Oriented Hierarchical Graphs [79.0426838808629]
We propose TAT-DQA, i.e. to answer the question over a visually-rich table-text document.
Specifically, we propose a novel Doc2SoarGraph framework with enhanced discrete reasoning capability.
We conduct extensive experiments on TAT-DQA dataset, and the results show that our proposed framework outperforms the best baseline model by 17.73% and 16.91% in terms of Exact Match (EM) and F1 score respectively on the test set.
arXiv Detail & Related papers (2023-05-03T07:30:32Z) - Multimodal Tree Decoder for Table of Contents Extraction in Document
Images [32.46909366312659]
Table of contents (ToC) extraction aims to extract headings of different levels in documents to better understand the outline of the contents.
We first introduce a standard dataset, HierDoc, including image samples from 650 documents of scientific papers with their content labels.
We propose a novel end-to-end model by using the multimodal tree decoder (MTD) for ToC as a benchmark for HierDoc.
arXiv Detail & Related papers (2022-12-06T11:38:31Z) - ReSel: N-ary Relation Extraction from Scientific Text and Tables by
Learning to Retrieve and Select [53.071352033539526]
We study the problem of extracting N-ary relations from scientific articles.
Our proposed method ReSel decomposes this task into a two-stage procedure.
Our experiments on three scientific information extraction datasets show that ReSel outperforms state-of-the-art baselines significantly.
arXiv Detail & Related papers (2022-10-26T02:28:02Z) - Long Document Summarization with Top-down and Bottom-up Inference [113.29319668246407]
We propose a principled inference framework to improve summarization models on two aspects.
Our framework assumes a hierarchical latent structure of a document where the top-level captures the long range dependency.
We demonstrate the effectiveness of the proposed framework on a diverse set of summarization datasets.
arXiv Detail & Related papers (2022-03-15T01:24:51Z) - Minimally-Supervised Structure-Rich Text Categorization via Learning on
Text-Rich Networks [61.23408995934415]
We propose a novel framework for minimally supervised categorization by learning from the text-rich network.
Specifically, we jointly train two modules with different inductive biases -- a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning.
Our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%.
arXiv Detail & Related papers (2021-02-23T04:14:34Z) - Summarize, Outline, and Elaborate: Long-Text Generation via Hierarchical
Supervision from Extractive Summaries [46.183289748907804]
We propose SOE, a pipelined system that outlines, outlining and elaborating for long text generation.
SOE produces long texts with significantly better quality, along with faster convergence speed.
arXiv Detail & Related papers (2020-10-14T13:22:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.