Related papers: Sequence-to-Sequence Models for Extracting Information from Registration and Legal Documents

Sequence-to-Sequence Models for Extracting Information from Registration and Legal Documents

URL: http://arxiv.org/abs/2201.05658v1
Date: Fri, 14 Jan 2022 20:20:12 GMT
Title: Sequence-to-Sequence Models for Extracting Information from Registration and Legal Documents
Authors: Ramon Pires and F\'abio C. de Souza and Guilherme Rosa and Roberto A. Lotufo and Rodrigo Nogueira
Abstract summary: We evaluate sequence-to-sequence models as an alternative to token-level classification methods for information extraction of legal and registration documents. We finetune models that jointly extract the information and generate the output already in a structured format. We propose a novel method to align the output with the input text, thus facilitating system inspection and auditing.
Score: 4.581762147208636
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A typical information extraction pipeline consists of token- or span-level classification models coupled with a series of pre- and post-processing scripts. In a production pipeline, requirements often change, with classes being added and removed, which leads to nontrivial modifications to the source code and the possible introduction of bugs. In this work, we evaluate sequence-to-sequence models as an alternative to token-level classification methods for information extraction of legal and registration documents. We finetune models that jointly extract the information and generate the output already in a structured format. Post-processing steps are learned during training, thus eliminating the need for rule-based methods and simplifying the pipeline. Furthermore, we propose a novel method to align the output with the input text, thus facilitating system inspection and auditing. Our experiments on four real-world datasets show that the proposed method is an alternative to classical pipelines.

Related papers

Enhancing Item Tokenization for Generative Recommendation through Self-Improvement [67.94240423434944]
Generative recommendation systems are driven by large language models (LLMs) Current item tokenization methods include using text descriptions, numerical strings, or sequences of discrete tokens. We propose a self-improving item tokenization method that allows the LLM to refine its own item tokenizations during training process.
arXiv Detail & Related papers (2024-12-22T21:56:15Z)
Lemur: Log Parsing with Entropy Sampling and Chain-of-Thought Merging [33.522495018321386]
We introduce a cutting-edge textbfLog parsing framework with textbfEntropy sampling and Chain-of-Thought textbfMerging (Lemur) We propose a novel sampling method inspired by information entropy, which efficiently clusters typical logs. Lemur achieves the state-of-the-art performance and impressive efficiency.
arXiv Detail & Related papers (2024-02-28T09:51:55Z)
Zero-Shot Text Matching for Automated Auditing using Sentence Transformers [0.3078691410268859]
We study the efficiency of unsupervised text matching using Sentence-Bert, a transformer-based model, by applying it to the semantic similarity of financial passages. Experimental results show that this model is robust to documents from in- and out-of-domain data.
arXiv Detail & Related papers (2022-10-28T11:52:16Z)
DORE: Document Ordered Relation Extraction based on Generative Framework [56.537386636819626]
This paper investigates the root cause of the underwhelming performance of the existing generative DocRE models. We propose to generate a symbolic and ordered sequence from the relation matrix which is deterministic and easier for model to learn. Experimental results on four datasets show that our proposed method can improve the performance of the generative DocRE models.
arXiv Detail & Related papers (2022-10-28T11:18:10Z)
Thutmose Tagger: Single-pass neural model for Inverse Text Normalization [76.87664008338317]
Inverse text normalization (ITN) is an essential post-processing step in automatic speech recognition. We present a dataset preparation method based on the granular alignment of ITN examples. One-to-one correspondence between tags and input words improves the interpretability of the model's predictions.
arXiv Detail & Related papers (2022-07-29T20:39:02Z)
Classifiers are Better Experts for Controllable Text Generation [63.17266060165098]
We show that the proposed method significantly outperforms recent PPLM, GeDi, and DExperts on PPL and sentiment accuracy based on the external classifier of generated texts. The same time, it is also easier to implement and tune, and has significantly fewer restrictions and requirements.
arXiv Detail & Related papers (2022-05-15T12:58:35Z)
Question-Based Salient Span Selection for More Controllable Text Summarization [67.68208237480646]
We propose a method for incorporating question-answering (QA) signals into a summarization model. Our method identifies salient noun phrases (NPs) in the input document by automatically generating wh-questions that are answered by the NPs. This QA-based signal is incorporated into a two-stage summarization model which first marks salient NPs in the input document using a classification model, then conditionally generates a summary.
arXiv Detail & Related papers (2021-11-15T17:36:41Z)
Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing [78.8500633981247]
This paper surveys and organizes research works in a new paradigm in natural language processing, which we dub "prompt-based learning" Unlike traditional supervised learning, which trains a model to take in an input x and predict an output y as P(y|x), prompt-based learning is based on language models that model the probability of text directly.
arXiv Detail & Related papers (2021-07-28T18:09:46Z)
Key Information Extraction From Documents: Evaluation And Generator [3.878105750489656]
This research project compares state-of-the-art models for information extraction from documents. The results have shown that NLP based pre-processing is beneficial for model performance. The use of a bounding box regression decoder increases the model performance only for fields that do not follow a rectangular shape.
arXiv Detail & Related papers (2021-06-09T16:12:21Z)
Doc2Dict: Information Extraction as Text Generation [0.0]
Doc2Dict is a pipeline for extracting document-level information. We train a language model on existing database records to generate structured spans. We use checkpointing and chunked encoding to apply our method to sequences of up to 32,000 tokens on a single baseline.
arXiv Detail & Related papers (2021-05-16T20:46:29Z)
Robust Document Representations using Latent Topics and Metadata [17.306088038339336]
We propose a novel approach to fine-tuning a pre-trained neural language model for document classification problems. We generate document representations that capture both text and metadata artifacts in a task manner. Our solution also incorporates metadata explicitly rather than just augmenting them with text.
arXiv Detail & Related papers (2020-10-23T21:52:38Z)
Pre-training Is (Almost) All You Need: An Application to Commonsense Reasoning [61.32992639292889]
Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks. We introduce a new scoring method that casts a plausibility ranking task in a full-text format. We show that our method provides a much more stable training phase across random restarts.
arXiv Detail & Related papers (2020-04-29T10:54:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.