Performance Enhancement Leveraging Mask-RCNN on Bengali Document Layout
Analysis
- URL: http://arxiv.org/abs/2308.10511v2
- Date: Tue, 22 Aug 2023 14:08:20 GMT
- Title: Performance Enhancement Leveraging Mask-RCNN on Bengali Document Layout
Analysis
- Authors: Shrestha Datta and Md Adith Mollah and Raisa Fairooz and Tariful Islam
Fahim
- Abstract summary: In the DL Sprint 2.0 competition, we worked on understanding Bangla documents.
We used a dataset called BaDLAD with lots of examples.
We trained a special model called Mask R-CNN to help with this understanding.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding digital documents is like solving a puzzle, especially
historical ones. Document Layout Analysis (DLA) helps with this puzzle by
dividing documents into sections like paragraphs, images, and tables. This is
crucial for machines to read and understand these documents. In the DL Sprint
2.0 competition, we worked on understanding Bangla documents. We used a dataset
called BaDLAD with lots of examples. We trained a special model called Mask
R-CNN to help with this understanding. We made this model better by
step-by-step hyperparameter tuning, and we achieved a good dice score of 0.889.
However, not everything went perfectly. We tried using a model trained for
English documents, but it didn't fit well with Bangla. This showed us that each
language has its own challenges. Our solution for the DL Sprint 2.0 is publicly
available at https://www.kaggle.com/competitions/dlsprint2/discussion/432201
along with notebooks, weights, and inference notebook.
Related papers
- PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling [63.93112754821312]
Document understanding is a challenging task to process and comprehend large amounts of textual and visual information.
Recent advances in Large Language Models (LLMs) have significantly improved the performance of this task.
We introduce PDF-WuKong, a multimodal large language model (MLLM) which is designed to enhance multimodal question-answering (QA) for long PDF documents.
arXiv Detail & Related papers (2024-10-08T12:17:42Z) - LiLiuM: eBay's Large Language Models for e-commerce [6.819297537500464]
We introduce the LiLiuM series of large language models (LLMs): 1B, 7B, and 13B parameter models developed 100% in-house.
This gives eBay full control over all aspects of the models including license, data, vocabulary, and architecture.
The LiLiuM LLMs have been trained on 3 trillion tokens of multilingual text from general and e-commerce domain.
arXiv Detail & Related papers (2024-06-17T18:45:41Z) - Bengali Document Layout Analysis with Detectron2 [0.0]
Document layout analysis involves segmenting documents into meaningful units like text boxes, paragraphs, images, and tables.
We improved the accuracy of the DLA model for Bengali documents by utilizing advanced Mask R-CNN models available in the Detectron2 library.
Results show the effectiveness of these models in accurately segmenting Bengali documents.
arXiv Detail & Related papers (2023-08-26T05:29:09Z) - Framework and Model Analysis on Bengali Document Layout Analysis
Dataset: BaDLAD [0.7925493098304448]
This study focuses on understanding Bengali Document Layouts using advanced computer programs: Detectron2, YOLOv8, and SAM.
By comparing their accuracy and speed, we learned which one is good for different types of documents.
arXiv Detail & Related papers (2023-08-15T07:52:24Z) - mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document
Understanding [55.4806974284156]
Document understanding refers to automatically extract, analyze and comprehend information from digital documents, such as a web page.
Existing Multi-model Large Language Models (MLLMs) have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition.
arXiv Detail & Related papers (2023-07-04T11:28:07Z) - Document Layout Annotation: Database and Benchmark in the Domain of
Public Affairs [62.38140271294419]
We propose a procedure to semi-automatically annotate digital documents with different layout labels.
We collect a novel database for DLA in the public affairs domain using a set of 24 data sources from the Spanish Administration.
The results of our experiments validate the proposed text labeling procedure with accuracy up to 99%.
arXiv Detail & Related papers (2023-06-12T08:21:50Z) - Translate to Disambiguate: Zero-shot Multilingual Word Sense
Disambiguation with Pretrained Language Models [67.19567060894563]
Pretrained Language Models (PLMs) learn rich cross-lingual knowledge and can be finetuned to perform well on diverse tasks.
We present a new study investigating how well PLMs capture cross-lingual word sense with Contextual Word-Level Translation (C-WLT)
We find that as the model size increases, PLMs encode more cross-lingual word sense knowledge and better use context to improve WLT performance.
arXiv Detail & Related papers (2023-04-26T19:55:52Z) - VTLayout: Fusion of Visual and Text Features for Document Layout
Analysis [5.836306027133707]
Document layout analysis (DLA) has the potential to capture rich information in historical or scientific documents on a large scale.
This paper proposes a VT model fusing the documents' deep visual, shallow visual, and text features to identify category blocks.
The identification capability of the VT is superior to the most advanced method of DLA based on the PubLayNet dataset, and the F1 score is as high as 0.9599.
arXiv Detail & Related papers (2021-08-12T17:12:11Z) - MexPub: Deep Transfer Learning for Metadata Extraction from German
Publications [1.1549572298362785]
We present a method that extracts metadata from PDF documents with different layouts and styles by viewing the document as an image.
Our method achieved an average accuracy of around $90%$ which validates its capability to accurately extract metadata from a variety of PDF documents.
arXiv Detail & Related papers (2021-06-04T09:43:48Z) - LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document
Understanding [49.941806975280045]
Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks.
We present text-bfLMv2 by pre-training text, layout and image in a multi-modal framework.
arXiv Detail & Related papers (2020-12-29T13:01:52Z) - LayoutLM: Pre-training of Text and Layout for Document Image
Understanding [108.12766816023783]
We propose the textbfLM to jointly model interactions between text and layout information across scanned document images.
This is the first time that text and layout are jointly learned in a single framework for document-level pre-training.
It achieves new state-of-the-art results in several downstream tasks, including form understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image classification (from 93.07 to 94.42)
arXiv Detail & Related papers (2019-12-31T14:31:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.