Learning from similarity and information extraction from structured
documents
- URL: http://arxiv.org/abs/2011.07964v2
- Date: Sat, 13 Mar 2021 21:36:56 GMT
- Title: Learning from similarity and information extraction from structured
documents
- Authors: Martin Hole\v{c}ek
- Abstract summary: The aim is to improve micro F1 of per-word classification on a huge real-world document dataset.
Results confirm that all proposed architecture parts are all required to beat the previous results.
The best model improves the previous state-of-the-art results by an 8.25 gain in F1 score.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The automation of document processing is gaining recent attention due to the
great potential to reduce manual work through improved methods and hardware.
Neural networks have been successfully applied before - even though they have
been trained only on relatively small datasets with hundreds of documents so
far. To successfully explore deep learning techniques and improve the
information extraction results, a dataset with more than twenty-five thousand
documents has been compiled, anonymized and is published as a part of this
work. We will expand our previous work where we proved that convolutions, graph
convolutions and self-attention can work together and exploit all the
information present in a structured document. Taking the fully trainable method
one step further, we will now design and examine various approaches to using
siamese networks, concepts of similarity, one-shot learning and context/memory
awareness. The aim is to improve micro F1 of per-word classification on the
huge real-world document dataset. The results verify the hypothesis that
trainable access to a similar (yet still different) page together with its
already known target information improves the information extraction.
Furthermore, the experiments confirm that all proposed architecture parts are
all required to beat the previous results. The best model improves the previous
state-of-the-art results by an 8.25 gain in F1 score. Qualitative analysis is
provided to verify that the new model performs better for all target classes.
Additionally, multiple structural observations about the causes of the
underperformance of some architectures are revealed. All the source codes,
parameters and implementation details are published together with the dataset
in the hope to push the research boundaries since all the techniques used in
this work are not problem-specific and can be generalized for other tasks and
contexts.
Related papers
- Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored.
We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches.
We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z) - Unveiling Document Structures with YOLOv5 Layout Detection [0.0]
This research investigates the utilization of YOLOv5, a cutting-edge computer vision model, for the purpose of rapidly identifying document layouts and extracting unstructured data.
The main objective is to create an autonomous system that can effectively recognize document layouts and extract unstructured data.
arXiv Detail & Related papers (2023-09-29T07:45:10Z) - Interactive Distillation of Large Single-Topic Corpora of Scientific
Papers [1.2954493726326113]
A more robust but time-consuming approach is to build the dataset constructively in which a subject matter expert handpicks documents.
Here we showcase a new tool, based on machine learning, for constructively generating targeted datasets of scientific literature.
arXiv Detail & Related papers (2023-09-19T17:18:36Z) - Incorporating Relevance Feedback for Information-Seeking Retrieval using
Few-Shot Document Re-Ranking [56.80065604034095]
We introduce a kNN approach that re-ranks documents based on their similarity with the query and the documents the user considers relevant.
To evaluate our different integration strategies, we transform four existing information retrieval datasets into the relevance feedback scenario.
arXiv Detail & Related papers (2022-10-19T16:19:37Z) - ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich
Document Understanding [52.3895498789521]
We propose ERNIE, a novel document pre-training solution with layout knowledge enhancement.
We first rearrange input sequences in the serialization stage, then present a correlative pre-training task, reading order prediction, and learn the proper reading order of documents.
Experimental results show ERNIE achieves superior performance on various downstream tasks, setting new state-of-the-art on key information, and document question answering.
arXiv Detail & Related papers (2022-10-12T12:59:24Z) - Autoregressive Search Engines: Generating Substrings as Document
Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers.
Previous work has explored ways to partition the search space into hierarchical structures.
In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z) - One-shot Key Information Extraction from Document with Deep Partial
Graph Matching [60.48651298832829]
Key Information Extraction (KIE) from documents improves efficiency, productivity, and security in many industrial scenarios.
Existing supervised learning methods for the KIE task need to feed a large number of labeled samples and learn separate models for different types of documents.
We propose a deep end-to-end trainable network for one-shot KIE using partial graph matching.
arXiv Detail & Related papers (2021-09-26T07:45:53Z) - Integrating Semantics and Neighborhood Information with Graph-Driven
Generative Models for Document Retrieval [51.823187647843945]
In this paper, we encode the neighborhood information with a graph-induced Gaussian distribution, and propose to integrate the two types of information with a graph-driven generative model.
Under the approximation, we prove that the training objective can be decomposed into terms involving only singleton or pairwise documents, enabling the model to be trained as efficiently as uncorrelated ones.
arXiv Detail & Related papers (2021-05-27T11:29:03Z) - Vision-Based Layout Detection from Scientific Literature using Recurrent
Convolutional Neural Networks [12.221478896815292]
We present an approach for adapting convolutional neural networks for object recognition and classification to scientific literature layout detection (SLLD)
SLLD is a shared subtask of several information extraction problems.
Our results show good improvement with fine-tuning of a pre-trained base network.
arXiv Detail & Related papers (2020-10-18T23:50:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.