Jointly Learning Span Extraction and Sequence Labeling for Information
Extraction from Business Documents
- URL: http://arxiv.org/abs/2205.13434v1
- Date: Thu, 26 May 2022 15:37:24 GMT
- Title: Jointly Learning Span Extraction and Sequence Labeling for Information
Extraction from Business Documents
- Authors: Nguyen Hong Son, Hieu M. Vu, Tuan-Anh D. Nguyen, Minh-Tien Nguyen
- Abstract summary: This paper introduces a new information extraction model for business documents.
It takes into account advantage of both span extraction and sequence labeling.
The model is trained end-to-end to jointly optimize the two tasks.
- Score: 1.6249267147413522
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces a new information extraction model for business
documents. Different from prior studies which only base on span extraction or
sequence labeling, the model takes into account advantage of both span
extraction and sequence labeling. The combination allows the model to deal with
long documents with sparse information (the small amount of extracted
information). The model is trained end-to-end to jointly optimize the two tasks
in a unified manner. Experimental results on four business datasets in English
and Japanese show that the model achieves promising results and is
significantly faster than the normal span-based extraction method. The code is
also available.
Related papers
- From News to Summaries: Building a Hungarian Corpus for Extractive and Abstractive Summarization [0.19107347888374507]
HunSum-2 is an open-source Hungarian corpus suitable for training abstractive and extractive summarization models.
The dataset is assembled from segments of the Common Crawl corpus undergoing thorough cleaning.
arXiv Detail & Related papers (2024-04-04T16:07:06Z) - Do the Benefits of Joint Models for Relation Extraction Extend to
Document-level Tasks? [5.8309706367176295]
Two distinct approaches have been proposed for relational triple extraction.
Joint models, which capture interactions across triples, are the more recent development.
We benchmark state-of-the-art pipeline and joint extraction models on sentence-level and document-level datasets.
arXiv Detail & Related papers (2023-10-01T15:09:36Z) - Modeling Entities as Semantic Points for Visual Information Extraction
in the Wild [55.91783742370978]
We propose an alternative approach to precisely and robustly extract key information from document images.
We explicitly model entities as semantic points, i.e., center points of entities are enriched with semantic information describing the attributes and relationships of different entities.
The proposed method can achieve significantly enhanced performance on entity labeling and linking, compared with previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-23T08:21:16Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - ReSel: N-ary Relation Extraction from Scientific Text and Tables by
Learning to Retrieve and Select [53.071352033539526]
We study the problem of extracting N-ary relations from scientific articles.
Our proposed method ReSel decomposes this task into a two-stage procedure.
Our experiments on three scientific information extraction datasets show that ReSel outperforms state-of-the-art baselines significantly.
arXiv Detail & Related papers (2022-10-26T02:28:02Z) - Key Information Extraction From Documents: Evaluation And Generator [3.878105750489656]
This research project compares state-of-the-art models for information extraction from documents.
The results have shown that NLP based pre-processing is beneficial for model performance.
The use of a bounding box regression decoder increases the model performance only for fields that do not follow a rectangular shape.
arXiv Detail & Related papers (2021-06-09T16:12:21Z) - A Span Extraction Approach for Information Extraction on Visually-Rich
Documents [2.3131309703965135]
We present a new approach to improve the capability of language model pre-training on visually-rich documents (VRDs)
Firstly, we introduce a new IE model that is query-based and employs the span extraction formulation instead of the commonly used sequence labelling approach.
We also propose a new training task which focuses on modelling the relationships between semantic entities within a document.
arXiv Detail & Related papers (2021-06-02T06:50:04Z) - Integrating Semantics and Neighborhood Information with Graph-Driven
Generative Models for Document Retrieval [51.823187647843945]
In this paper, we encode the neighborhood information with a graph-induced Gaussian distribution, and propose to integrate the two types of information with a graph-driven generative model.
Under the approximation, we prove that the training objective can be decomposed into terms involving only singleton or pairwise documents, enabling the model to be trained as efficiently as uncorrelated ones.
arXiv Detail & Related papers (2021-05-27T11:29:03Z) - Leveraging Graph to Improve Abstractive Multi-Document Summarization [50.62418656177642]
We develop a neural abstractive multi-document summarization (MDS) model which can leverage well-known graph representations of documents.
Our model utilizes graphs to encode documents in order to capture cross-document relations, which is crucial to summarizing long documents.
Our model can also take advantage of graphs to guide the summary generation process, which is beneficial for generating coherent and concise summaries.
arXiv Detail & Related papers (2020-05-20T13:39:47Z) - IMoJIE: Iterative Memory-Based Joint Open Information Extraction [37.487044478970965]
We present IMoJIE, an extension to CopyAttention, which produces the next extraction conditioned on all previously extracteds.
IMoJIE outperforms CopyAttention by about 18 F1 pts, and a BERT-based strong baseline by 2 F1 pts.
arXiv Detail & Related papers (2020-05-17T07:04:08Z) - Pre-training for Abstractive Document Summarization by Reinstating
Source Text [105.77348528847337]
This paper presents three pre-training objectives which allow us to pre-train a Seq2Seq based abstractive summarization model on unlabeled text.
Experiments on two benchmark summarization datasets show that all three objectives can improve performance upon baselines.
arXiv Detail & Related papers (2020-04-04T05:06:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.