Doc2Dict: Information Extraction as Text Generation
- URL: http://arxiv.org/abs/2105.07510v1
- Date: Sun, 16 May 2021 20:46:29 GMT
- Title: Doc2Dict: Information Extraction as Text Generation
- Authors: Benjamin Townsend, Eamon Ito-Fisher, Lily Zhang and Madison May
- Abstract summary: Doc2Dict is a pipeline for extracting document-level information.
We train a language model on existing database records to generate structured spans.
We use checkpointing and chunked encoding to apply our method to sequences of up to 32,000 tokens on a single baseline.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Typically, information extraction (IE) requires a pipeline approach: first, a
sequence labeling model is trained on manually annotated documents to extract
relevant spans; then, when a new document arrives, a model predicts spans which
are then post-processed and standardized to convert the information into a
database entry. We replace this labor-intensive workflow with a transformer
language model trained on existing database records to directly generate
structured JSON. Our solution removes the workload associated with producing
token-level annotations and takes advantage of a data source which is generally
quite plentiful (e.g. database records). As long documents are common in
information extraction tasks, we use gradient checkpointing and chunked
encoding to apply our method to sequences of up to 32,000 tokens on a single
GPU. Our Doc2Dict approach is competitive with more complex, hand-engineered
pipelines and offers a simple but effective baseline for document-level
information extraction. We release our Doc2Dict model and code to reproduce our
experiments and facilitate future work.
Related papers
- Less is More: Making Smaller Language Models Competent Subgraph Retrievers for Multi-hop KGQA [51.3033125256716]
We model the subgraph retrieval task as a conditional generation task handled by small language models.
Our base generative subgraph retrieval model, consisting of only 220M parameters, competitive retrieval performance compared to state-of-the-art models.
Our largest 3B model, when plugged with an LLM reader, sets new SOTA end-to-end performance on both the WebQSP and CWQ benchmarks.
arXiv Detail & Related papers (2024-10-08T15:22:36Z) - In-context Pretraining: Language Modeling Beyond Document Boundaries [137.53145699439898]
In-Context Pretraining is a new approach where language models are pretrained on a sequence of related documents.
We introduce approximate algorithms for finding related documents with efficient nearest neighbor search.
We see notable improvements in tasks that require more complex contextual reasoning.
arXiv Detail & Related papers (2023-10-16T17:57:12Z) - Plug-and-Play Document Modules for Pre-trained Models [92.9897146991974]
We propose to represent each document as a plug-and-play document module, i.e., a document plugin, for PTMs (PlugD)
By inserting document plugins into the backbone PTM for downstream tasks, we can encode a document one time to handle multiple tasks.
Experiments on 8 datasets of 4 typical NLP tasks show that PlugD enables models to encode documents once and for all across different scenarios.
arXiv Detail & Related papers (2023-05-28T08:01:40Z) - CED: Catalog Extraction from Documents [12.037861186708799]
We propose a transition-based framework for parsing documents into catalog trees.
We believe the CED task could fill the gap between raw text segments and information extraction tasks on extremely long documents.
arXiv Detail & Related papers (2023-04-28T07:32:00Z) - DoSA : A System to Accelerate Annotations on Business Documents with
Human-in-the-Loop [0.0]
DoSA (Document Specific Automated s) helps annotators in generating initial annotations automatically using our novel bootstrap approach.
An open-source ready-to-use implementation is made available on GitHub.
arXiv Detail & Related papers (2022-11-09T15:04:07Z) - XDoc: Unified Pre-training for Cross-Format Document Understanding [84.63416346227176]
XDoc is a unified pre-trained model which deals with different document formats in a single model.
XDoc achieves comparable or even better performance on a variety of downstream tasks compared with the individual pre-trained models.
arXiv Detail & Related papers (2022-10-06T12:07:18Z) - Generate rather than Retrieve: Large Language Models are Strong Context
Generators [74.87021992611672]
We present a novel perspective for solving knowledge-intensive tasks by replacing document retrievers with large language model generators.
We call our method generate-then-read (GenRead), which first prompts a large language model to generate contextutal documents based on a given question, and then reads the generated documents to produce the final answer.
arXiv Detail & Related papers (2022-09-21T01:30:59Z) - Learning Diverse Document Representations with Deep Query Interactions
for Dense Retrieval [79.37614949970013]
We propose a new dense retrieval model which learns diverse document representations with deep query interactions.
Our model encodes each document with a set of generated pseudo-queries to get query-informed, multi-view document representations.
arXiv Detail & Related papers (2022-08-08T16:00:55Z) - A sequence-to-sequence approach for document-level relation extraction [4.906513405712846]
Document-level relation extraction (DocRE) requires integrating information within and across sentences.
Seq2rel can learn the subtasks of DocRE end-to-end, replacing a pipeline of task-specific components.
arXiv Detail & Related papers (2022-04-03T16:03:19Z) - Sequence-to-Sequence Models for Extracting Information from Registration
and Legal Documents [4.581762147208636]
We evaluate sequence-to-sequence models as an alternative to token-level classification methods for information extraction of legal and registration documents.
We finetune models that jointly extract the information and generate the output already in a structured format.
We propose a novel method to align the output with the input text, thus facilitating system inspection and auditing.
arXiv Detail & Related papers (2022-01-14T20:20:12Z) - Key Information Extraction From Documents: Evaluation And Generator [3.878105750489656]
This research project compares state-of-the-art models for information extraction from documents.
The results have shown that NLP based pre-processing is beneficial for model performance.
The use of a bounding box regression decoder increases the model performance only for fields that do not follow a rectangular shape.
arXiv Detail & Related papers (2021-06-09T16:12:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.