Jointly Learning Span Extraction and Sequence Labeling for Information
Extraction from Business Documents
- URL: http://arxiv.org/abs/2205.13434v1
- Date: Thu, 26 May 2022 15:37:24 GMT
- Title: Jointly Learning Span Extraction and Sequence Labeling for Information
Extraction from Business Documents
- Authors: Nguyen Hong Son, Hieu M. Vu, Tuan-Anh D. Nguyen, Minh-Tien Nguyen
- Abstract summary: This paper introduces a new information extraction model for business documents.
It takes into account advantage of both span extraction and sequence labeling.
The model is trained end-to-end to jointly optimize the two tasks.
- Score: 1.6249267147413522
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces a new information extraction model for business
documents. Different from prior studies which only base on span extraction or
sequence labeling, the model takes into account advantage of both span
extraction and sequence labeling. The combination allows the model to deal with
long documents with sparse information (the small amount of extracted
information). The model is trained end-to-end to jointly optimize the two tasks
in a unified manner. Experimental results on four business datasets in English
and Japanese show that the model achieves promising results and is
significantly faster than the normal span-based extraction method. The code is
also available.
Related papers
- CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation [51.2289822267563]
We propose Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT), a method for generating synthetic datasets.
We use large-scale public web-crawled corpora and similarity-based document retrieval to find other relevant human-written documents.
We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks.
arXiv Detail & Related papers (2024-09-03T17:54:40Z) - FabricQA-Extractor: A Question Answering System to Extract Information from Documents using Natural Language Questions [4.961045761391367]
Reading comprehension models answer questions posed in natural language when provided with a short passage of text.
We introduce a new model, Relation Coherence, that exploits knowledge of the relational structure to improve the extraction quality.
We demonstrate on two datasets that Relation Coherence boosts extraction performance and evaluate FabricQA-Extractor on large scale datasets.
arXiv Detail & Related papers (2024-08-17T15:16:54Z) - From News to Summaries: Building a Hungarian Corpus for Extractive and Abstractive Summarization [0.19107347888374507]
HunSum-2 is an open-source Hungarian corpus suitable for training abstractive and extractive summarization models.
The dataset is assembled from segments of the Common Crawl corpus undergoing thorough cleaning.
arXiv Detail & Related papers (2024-04-04T16:07:06Z) - Do the Benefits of Joint Models for Relation Extraction Extend to
Document-level Tasks? [5.8309706367176295]
Two distinct approaches have been proposed for relational triple extraction.
Joint models, which capture interactions across triples, are the more recent development.
We benchmark state-of-the-art pipeline and joint extraction models on sentence-level and document-level datasets.
arXiv Detail & Related papers (2023-10-01T15:09:36Z) - Modeling Entities as Semantic Points for Visual Information Extraction
in the Wild [55.91783742370978]
We propose an alternative approach to precisely and robustly extract key information from document images.
We explicitly model entities as semantic points, i.e., center points of entities are enriched with semantic information describing the attributes and relationships of different entities.
The proposed method can achieve significantly enhanced performance on entity labeling and linking, compared with previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-23T08:21:16Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - ReSel: N-ary Relation Extraction from Scientific Text and Tables by
Learning to Retrieve and Select [53.071352033539526]
We study the problem of extracting N-ary relations from scientific articles.
Our proposed method ReSel decomposes this task into a two-stage procedure.
Our experiments on three scientific information extraction datasets show that ReSel outperforms state-of-the-art baselines significantly.
arXiv Detail & Related papers (2022-10-26T02:28:02Z) - Key Information Extraction From Documents: Evaluation And Generator [3.878105750489656]
This research project compares state-of-the-art models for information extraction from documents.
The results have shown that NLP based pre-processing is beneficial for model performance.
The use of a bounding box regression decoder increases the model performance only for fields that do not follow a rectangular shape.
arXiv Detail & Related papers (2021-06-09T16:12:21Z) - Leveraging Graph to Improve Abstractive Multi-Document Summarization [50.62418656177642]
We develop a neural abstractive multi-document summarization (MDS) model which can leverage well-known graph representations of documents.
Our model utilizes graphs to encode documents in order to capture cross-document relations, which is crucial to summarizing long documents.
Our model can also take advantage of graphs to guide the summary generation process, which is beneficial for generating coherent and concise summaries.
arXiv Detail & Related papers (2020-05-20T13:39:47Z) - IMoJIE: Iterative Memory-Based Joint Open Information Extraction [37.487044478970965]
We present IMoJIE, an extension to CopyAttention, which produces the next extraction conditioned on all previously extracteds.
IMoJIE outperforms CopyAttention by about 18 F1 pts, and a BERT-based strong baseline by 2 F1 pts.
arXiv Detail & Related papers (2020-05-17T07:04:08Z) - Pre-training for Abstractive Document Summarization by Reinstating
Source Text [105.77348528847337]
This paper presents three pre-training objectives which allow us to pre-train a Seq2Seq based abstractive summarization model on unlabeled text.
Experiments on two benchmark summarization datasets show that all three objectives can improve performance upon baselines.
arXiv Detail & Related papers (2020-04-04T05:06:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.