Sequence-to-Sequence Models for Extracting Information from Registration
  and Legal Documents
        - URL: http://arxiv.org/abs/2201.05658v1
- Date: Fri, 14 Jan 2022 20:20:12 GMT
- Title: Sequence-to-Sequence Models for Extracting Information from Registration
  and Legal Documents
- Authors: Ramon Pires and F\'abio C. de Souza and Guilherme Rosa and Roberto A.
  Lotufo and Rodrigo Nogueira
- Abstract summary: We evaluate sequence-to-sequence models as an alternative to token-level classification methods for information extraction of legal and registration documents.
We finetune models that jointly extract the information and generate the output already in a structured format.
We propose a novel method to align the output with the input text, thus facilitating system inspection and auditing.
- Score: 4.581762147208636
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   A typical information extraction pipeline consists of token- or span-level
classification models coupled with a series of pre- and post-processing
scripts. In a production pipeline, requirements often change, with classes
being added and removed, which leads to nontrivial modifications to the source
code and the possible introduction of bugs. In this work, we evaluate
sequence-to-sequence models as an alternative to token-level classification
methods for information extraction of legal and registration documents. We
finetune models that jointly extract the information and generate the output
already in a structured format. Post-processing steps are learned during
training, thus eliminating the need for rule-based methods and simplifying the
pipeline. Furthermore, we propose a novel method to align the output with the
input text, thus facilitating system inspection and auditing. Our experiments
on four real-world datasets show that the proposed method is an alternative to
classical pipelines.
 
      
        Related papers
        - Enhancing Item Tokenization for Generative Recommendation through   Self-Improvement [67.94240423434944]
 Generative recommendation systems are driven by large language models (LLMs)
Current item tokenization methods include using text descriptions, numerical strings, or sequences of discrete tokens.
We propose a self-improving item tokenization method that allows the LLM to refine its own item tokenizations during training process.
 arXiv  Detail & Related papers  (2024-12-22T21:56:15Z)
- Lemur: Log Parsing with Entropy Sampling and Chain-of-Thought Merging [33.522495018321386]
 We introduce a cutting-edge textbfLog parsing framework with textbfEntropy sampling and Chain-of-Thought textbfMerging (Lemur)
We propose a novel sampling method inspired by information entropy, which efficiently clusters typical logs.
Lemur achieves the state-of-the-art performance and impressive efficiency.
 arXiv  Detail & Related papers  (2024-02-28T09:51:55Z)
- Zero-Shot Text Matching for Automated Auditing using Sentence
  Transformers [0.3078691410268859]
 We study the efficiency of unsupervised text matching using Sentence-Bert, a transformer-based model, by applying it to the semantic similarity of financial passages.
 Experimental results show that this model is robust to documents from in- and out-of-domain data.
 arXiv  Detail & Related papers  (2022-10-28T11:52:16Z)
- DORE: Document Ordered Relation Extraction based on Generative Framework [56.537386636819626]
 This paper investigates the root cause of the underwhelming performance of the existing generative DocRE models.
We propose to generate a symbolic and ordered sequence from the relation matrix which is deterministic and easier for model to learn.
 Experimental results on four datasets show that our proposed method can improve the performance of the generative DocRE models.
 arXiv  Detail & Related papers  (2022-10-28T11:18:10Z)
- Thutmose Tagger: Single-pass neural model for Inverse Text Normalization [76.87664008338317]
 Inverse text normalization (ITN) is an essential post-processing step in automatic speech recognition.
We present a dataset preparation method based on the granular alignment of ITN examples.
One-to-one correspondence between tags and input words improves the interpretability of the model's predictions.
 arXiv  Detail & Related papers  (2022-07-29T20:39:02Z)
- Classifiers are Better Experts for Controllable Text Generation [63.17266060165098]
 We show that the proposed method significantly outperforms recent PPLM, GeDi, and DExperts on PPL and sentiment accuracy based on the external classifier of generated texts.
The same time, it is also easier to implement and tune, and has significantly fewer restrictions and requirements.
 arXiv  Detail & Related papers  (2022-05-15T12:58:35Z)
- Question-Based Salient Span Selection for More Controllable Text
  Summarization [67.68208237480646]
 We propose a method for incorporating question-answering (QA) signals into a summarization model.
Our method identifies salient noun phrases (NPs) in the input document by automatically generating wh-questions that are answered by the NPs.
This QA-based signal is incorporated into a two-stage summarization model which first marks salient NPs in the input document using a classification model, then conditionally generates a summary.
 arXiv  Detail & Related papers  (2021-11-15T17:36:41Z)
- Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods
  in Natural Language Processing [78.8500633981247]
 This paper surveys and organizes research works in a new paradigm in natural language processing, which we dub "prompt-based learning"
Unlike traditional supervised learning, which trains a model to take in an input x and predict an output y as P(y|x), prompt-based learning is based on language models that model the probability of text directly.
 arXiv  Detail & Related papers  (2021-07-28T18:09:46Z)
- Key Information Extraction From Documents: Evaluation And Generator [3.878105750489656]
 This research project compares state-of-the-art models for information extraction from documents.
The results have shown that NLP based pre-processing is beneficial for model performance.
The use of a bounding box regression decoder increases the model performance only for fields that do not follow a rectangular shape.
 arXiv  Detail & Related papers  (2021-06-09T16:12:21Z)
- Doc2Dict: Information Extraction as Text Generation [0.0]
 Doc2Dict is a pipeline for extracting document-level information.
We train a language model on existing database records to generate structured spans.
We use checkpointing and chunked encoding to apply our method to sequences of up to 32,000 tokens on a single baseline.
 arXiv  Detail & Related papers  (2021-05-16T20:46:29Z)
- Robust Document Representations using Latent Topics and Metadata [17.306088038339336]
 We propose a novel approach to fine-tuning a pre-trained neural language model for document classification problems.
We generate document representations that capture both text and metadata artifacts in a task manner.
Our solution also incorporates metadata explicitly rather than just augmenting them with text.
 arXiv  Detail & Related papers  (2020-10-23T21:52:38Z)
- Pre-training Is (Almost) All You Need: An Application to Commonsense
  Reasoning [61.32992639292889]
 Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks.
We introduce a new scoring method that casts a plausibility ranking task in a full-text format.
We show that our method provides a much more stable training phase across random restarts.
 arXiv  Detail & Related papers  (2020-04-29T10:54:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.