Information Extraction from Documents: Question Answering vs Token
Classification in real-world setups
- URL: http://arxiv.org/abs/2304.10994v1
- Date: Fri, 21 Apr 2023 14:43:42 GMT
- Title: Information Extraction from Documents: Question Answering vs Token
Classification in real-world setups
- Authors: Laurent Lam, Pirashanth Ratnamogan, Jo\"el Tang, William Vanhuffel and
Fabien Caspani
- Abstract summary: We compare the Question Answering approach with the classical token classification approach for document key information extraction.
Our research showed that when dealing with clean and relatively short entities, it is still best to use token classification-based approach.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Research in Document Intelligence and especially in Document Key Information
Extraction (DocKIE) has been mainly solved as Token Classification problem.
Recent breakthroughs in both natural language processing (NLP) and computer
vision helped building document-focused pre-training methods, leveraging a
multimodal understanding of the document text, layout and image modalities.
However, these breakthroughs also led to the emergence of a new DocKIE subtask
of extractive document Question Answering (DocQA), as part of the Machine
Reading Comprehension (MRC) research field. In this work, we compare the
Question Answering approach with the classical token classification approach
for document key information extraction. We designed experiments to benchmark
five different experimental setups : raw performances, robustness to noisy
environment, capacity to extract long entities, fine-tuning speed on Few-Shot
Learning and finally Zero-Shot Learning. Our research showed that when dealing
with clean and relatively short entities, it is still best to use token
classification-based approach, while the QA approach could be a good
alternative for noisy environment or long entities use-cases.
Related papers
- Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings.
First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss.
Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z) - On Task-personalized Multimodal Few-shot Learning for Visually-rich
Document Entity Retrieval [59.25292920967197]
Few-shot document entity retrieval (VDER) is an important topic in industrial NLP applications.
FewVEX is a new dataset to boost future research in the field of entity-level few-shot VDER.
We present a task-aware meta-learning based framework, with a central focus on achieving effective task personalization.
arXiv Detail & Related papers (2023-11-01T17:51:43Z) - Attention Where It Matters: Rethinking Visual Document Understanding
with Selective Region Concentration [26.408343160223517]
We propose a novel end-to-end document understanding model called SeRum.
SeRum converts image understanding and recognition tasks into a local decoding process of the visual tokens of interest.
We show that SeRum achieves state-of-the-art performance on document understanding tasks and competitive results on text spotting tasks.
arXiv Detail & Related papers (2023-09-03T10:14:34Z) - Information Extraction in Domain and Generic Documents: Findings from
Heuristic-based and Data-driven Approaches [0.0]
Information extraction plays important role in natural language processing.
Document genre and length influence on IE tasks.
No single method demonstrated overwhelming performance in both tasks.
arXiv Detail & Related papers (2023-06-30T20:43:27Z) - Document Provenance and Authentication through Authorship Classification [5.2545206693029884]
We propose an ensemble-based text-processing framework for the classification of single and multi-authored documents.
The proposed framework incorporates several state-of-the-art text classification algorithms.
The framework is evaluated on a large-scale benchmark dataset.
arXiv Detail & Related papers (2023-03-02T12:26:03Z) - Generate rather than Retrieve: Large Language Models are Strong Context
Generators [74.87021992611672]
We present a novel perspective for solving knowledge-intensive tasks by replacing document retrievers with large language model generators.
We call our method generate-then-read (GenRead), which first prompts a large language model to generate contextutal documents based on a given question, and then reads the generated documents to produce the final answer.
arXiv Detail & Related papers (2022-09-21T01:30:59Z) - Layout-Aware Information Extraction for Document-Grounded Dialogue:
Dataset, Method and Demonstration [75.47708732473586]
We propose a layout-aware document-level Information Extraction dataset, LIE, to facilitate the study of extracting both structural and semantic knowledge from visually rich documents.
LIE contains 62k annotations of three extraction tasks from 4,061 pages in product and official documents.
Empirical results show that layout is critical for VRD-based extraction, and system demonstration also verifies that the extracted knowledge can help locate the answers that users care about.
arXiv Detail & Related papers (2022-07-14T07:59:45Z) - Questions Are All You Need to Train a Dense Passage Retriever [123.13872383489172]
ART is a new corpus-level autoencoding approach for training dense retrieval models that does not require any labeled training data.
It uses a new document-retrieval autoencoding scheme, where (1) an input question is used to retrieve a set of evidence documents, and (2) the documents are then used to compute the probability of reconstructing the original question.
arXiv Detail & Related papers (2022-06-21T18:16:31Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - OPAD: An Optimized Policy-based Active Learning Framework for Document
Content Analysis [6.159771892460152]
We propose textitOPAD, a novel framework using reinforcement policy for active learning in content detection tasks for documents.
The framework learns the acquisition function to decide the samples to be selected while optimizing performance metrics.
We show superior performance of the proposed textitOPAD framework for active learning for various tasks related to document understanding.
arXiv Detail & Related papers (2021-10-01T07:40:56Z) - Timestamping Documents and Beliefs [1.4467794332678539]
Document dating is a challenging problem which requires inference over the temporal structure of the document.
In this paper we propose NeuralDater, a Graph Convolutional Network (GCN) based document dating approach.
We also propose AD3: Attentive Deep Document Dater, an attention-based document dating system.
arXiv Detail & Related papers (2021-06-09T02:12:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.