Context-Aware Classification of Legal Document Pages
- URL: http://arxiv.org/abs/2304.02787v2
- Date: Tue, 25 Apr 2023 14:59:24 GMT
- Title: Context-Aware Classification of Legal Document Pages
- Authors: Pavlos Fragkogiannis, Martina Forster, Grace E. Lee, Dell Zhang
- Abstract summary: We present a simple but effective approach that overcomes the constraint on input length.
Specifically, we enhance the input with extra tokens carrying sequential information about previous pages.
Our experiments conducted on two legal datasets in English and Portuguese respectively show that the proposed approach can significantly improve the performance of document page classification.
- Score: 7.306025535482021
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For many business applications that require the processing, indexing, and
retrieval of professional documents such as legal briefs (in PDF format etc.),
it is often essential to classify the pages of any given document into their
corresponding types beforehand. Most existing studies in the field of document
image classification either focus on single-page documents or treat multiple
pages in a document independently. Although in recent years a few techniques
have been proposed to exploit the context information from neighboring pages to
enhance document page classification, they typically cannot be utilized with
large pre-trained language models due to the constraint on input length. In
this paper, we present a simple but effective approach that overcomes the above
limitation. Specifically, we enhance the input with extra tokens carrying
sequential information about previous pages - introducing recurrence - which
enables the usage of pre-trained Transformer models like BERT for context-aware
page classification. Our experiments conducted on two legal datasets in English
and Portuguese respectively show that the proposed approach can significantly
improve the performance of document page classification compared to the
non-recurrent setup as well as the other context-aware baselines.
Related papers
- Unified Multi-Modal Interleaved Document Representation for Information Retrieval [57.65409208879344]
We produce more comprehensive and nuanced document representations by holistically embedding documents interleaved with different modalities.
Specifically, we achieve this by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation.
arXiv Detail & Related papers (2024-10-03T17:49:09Z) - Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings.
First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss.
Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z) - μgat: Improving Single-Page Document Parsing by Providing Multi-Page Context [26.820913216377903]
This work focuses on Regesta Pontificum Romanum, a large collection of papal registers.
Regesta are catalogs of summaries of other documents and, in some cases, are the only source of information about the content of such full-length documents.
arXiv Detail & Related papers (2024-08-28T09:01:18Z) - GRAM: Global Reasoning for Multi-Page VQA [14.980413646626234]
We present GRAM, a method that seamlessly extends pre-trained single-page models to the multi-page setting.
To do so, we leverage a single-page encoder for local page-level understanding, and enhance it with document-level designated layers and learnable tokens.
For additional computational savings during decoding, we introduce an optional compression stage.
arXiv Detail & Related papers (2024-01-07T08:03:06Z) - In-context Pretraining: Language Modeling Beyond Document Boundaries [137.53145699439898]
In-Context Pretraining is a new approach where language models are pretrained on a sequence of related documents.
We introduce approximate algorithms for finding related documents with efficient nearest neighbor search.
We see notable improvements in tasks that require more complex contextual reasoning.
arXiv Detail & Related papers (2023-10-16T17:57:12Z) - Beyond Document Page Classification: Design, Datasets, and Challenges [32.94494070330065]
This paper highlights the need to bring document classification benchmarking closer to real-world applications.
We identify the lack of public multi-page document classification datasets, formalize different classification tasks arising in application scenarios, and motivate the value of targeting efficient multi-page document representations.
arXiv Detail & Related papers (2023-08-24T16:16:47Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - Multilevel Text Alignment with Cross-Document Attention [59.76351805607481]
Existing alignment methods operate at a single, predefined level.
We propose a new learning approach that equips previously established hierarchical attention encoders for representing documents with a cross-document attention component.
arXiv Detail & Related papers (2020-10-03T02:52:28Z) - Towards a Multi-modal, Multi-task Learning based Pre-training Framework
for Document Representation Learning [5.109216329453963]
We introduce Document Topic Modelling and Document Shuffle Prediction as novel pre-training tasks.
We utilize the Longformer network architecture as the backbone to encode the multi-modal information from multi-page documents in an end-to-end fashion.
arXiv Detail & Related papers (2020-09-30T05:39:04Z) - SPECTER: Document-level Representation Learning using Citation-informed
Transformers [51.048515757909215]
SPECTER generates document-level embedding of scientific documents based on pretraining a Transformer language model.
We introduce SciDocs, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction to document classification and recommendation.
arXiv Detail & Related papers (2020-04-15T16:05:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.