Sequence-aware multimodal page classification of Brazilian legal
documents
- URL: http://arxiv.org/abs/2207.00748v1
- Date: Sat, 2 Jul 2022 06:23:25 GMT
- Title: Sequence-aware multimodal page classification of Brazilian legal
documents
- Authors: Pedro H. Luz de Araujo, Ana Paula G. S. de Almeida, Fabricio A. Braz,
Nilton C. da Silva, Flavio de Barros Vidal, Teofilo E. de Campos
- Abstract summary: We train and evaluate our methods on a novel multimodal dataset of 6,510 lawsuits.
Each lawsuit is an ordered sequence of pages, which are stored both as an image and as a corresponding text.
We use them as extractors of visual and textual features, which are then combined through our proposed Fusion Module.
- Score: 0.21204495827342434
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Brazilian Supreme Court receives tens of thousands of cases each
semester. Court employees spend thousands of hours to execute the initial
analysis and classification of those cases -- which takes effort away from
posterior, more complex stages of the case management workflow. In this paper,
we explore multimodal classification of documents from Brazil's Supreme Court.
We train and evaluate our methods on a novel multimodal dataset of 6,510
lawsuits (339,478 pages) with manual annotation assigning each page to one of
six classes. Each lawsuit is an ordered sequence of pages, which are stored
both as an image and as a corresponding text extracted through optical
character recognition. We first train two unimodal classifiers: a ResNet
pre-trained on ImageNet is fine-tuned on the images, and a convolutional
network with filters of multiple kernel sizes is trained from scratch on
document texts. We use them as extractors of visual and textual features, which
are then combined through our proposed Fusion Module. Our Fusion Module can
handle missing textual or visual input by using learned embeddings for missing
data. Moreover, we experiment with bi-directional Long Short-Term Memory
(biLSTM) networks and linear-chain conditional random fields to model the
sequential nature of the pages. The multimodal approaches outperform both
textual and visual classifiers, especially when leveraging the sequential
nature of the pages.
Related papers
- Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks [62.758680527838436]
Leopard is a vision-language model for handling vision-language tasks involving multiple text-rich images.
First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios.
Second, we developed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length.
arXiv Detail & Related papers (2024-10-02T16:55:01Z) - Hierarchical Multi-modal Transformer for Cross-modal Long Document Classification [74.45521856327001]
How to classify long documents with hierarchical structure texts and embedding images is a new problem.
We propose a novel approach called Hierarchical Multi-modal Transformer (HMT) for cross-modal long document classification.
Our approach uses a multi-modal transformer and a dynamic multi-scale multi-modal transformer to model the complex relationships between image features, and the section and sentence features.
arXiv Detail & Related papers (2024-07-14T07:12:25Z) - Multi-modal Generation via Cross-Modal In-Context Learning [50.45304937804883]
We propose a Multi-modal Generation via Cross-Modal In-Context Learning (MGCC) method that generates novel images from complex multimodal prompt sequences.
Our MGCC demonstrates a diverse range of multimodal capabilities, like novel image generation, the facilitation of multimodal dialogue, and generation of texts.
arXiv Detail & Related papers (2024-05-28T15:58:31Z) - MMTryon: Multi-Modal Multi-Reference Control for High-Quality Fashion Generation [70.83668869857665]
MMTryon is a multi-modal multi-reference VIrtual Try-ON framework.
It can generate high-quality compositional try-on results by taking a text instruction and multiple garment images as inputs.
arXiv Detail & Related papers (2024-05-01T11:04:22Z) - MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer [106.79844459065828]
This paper presents MM-Interleaved, an end-to-end generative model for interleaved image-text data.
It introduces a multi-scale and multi-image feature synchronizer module, allowing direct access to fine-grained image features in the previous context.
Experiments demonstrate the versatility of MM-Interleaved in recognizing visual details following multi-modal instructions and generating consistent images following both textual and visual conditions.
arXiv Detail & Related papers (2024-01-18T18:50:16Z) - Emu: Generative Pretraining in Multimodality [43.759593451544546]
Transformer-based multimodal foundation model can seamlessly generate images and texts in multimodal context.
Emu can serve as a generalist multimodal interface for both image-to-text and text-to-image tasks.
Emu demonstrates superb performance compared to state-of-the-art large multimodal models.
arXiv Detail & Related papers (2023-07-11T12:45:39Z) - SelfDoc: Self-Supervised Document Representation Learning [46.22910270334824]
SelfDoc is a task-agnostic pre-training framework for document image understanding.
Our framework exploits the positional, textual, and visual information of every semantically meaningful component in a document.
It achieves superior performance on multiple downstream tasks with significantly fewer document images used in the pre-training stage compared to previous works.
arXiv Detail & Related papers (2021-06-07T04:19:49Z) - Towards Open-World Text-Guided Face Image Generation and Manipulation [52.83401421019309]
We propose a unified framework for both face image generation and manipulation.
Our method supports open-world scenarios, including both image and text, without any re-training, fine-tuning, or post-processing.
arXiv Detail & Related papers (2021-04-18T16:56:07Z) - Towards a Multi-modal, Multi-task Learning based Pre-training Framework
for Document Representation Learning [5.109216329453963]
We introduce Document Topic Modelling and Document Shuffle Prediction as novel pre-training tasks.
We utilize the Longformer network architecture as the backbone to encode the multi-modal information from multi-page documents in an end-to-end fashion.
arXiv Detail & Related papers (2020-09-30T05:39:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.