Related papers: Performance Enhancement Leveraging Mask-RCNN on Bengali Document Layout Analysis

Performance Enhancement Leveraging Mask-RCNN on Bengali Document Layout Analysis

URL: http://arxiv.org/abs/2308.10511v2
Date: Tue, 22 Aug 2023 14:08:20 GMT
Title: Performance Enhancement Leveraging Mask-RCNN on Bengali Document Layout Analysis
Authors: Shrestha Datta and Md Adith Mollah and Raisa Fairooz and Tariful Islam Fahim
Abstract summary: In the DL Sprint 2.0 competition, we worked on understanding Bangla documents. We used a dataset called BaDLAD with lots of examples. We trained a special model called Mask R-CNN to help with this understanding.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Understanding digital documents is like solving a puzzle, especially historical ones. Document Layout Analysis (DLA) helps with this puzzle by dividing documents into sections like paragraphs, images, and tables. This is crucial for machines to read and understand these documents. In the DL Sprint 2.0 competition, we worked on understanding Bangla documents. We used a dataset called BaDLAD with lots of examples. We trained a special model called Mask R-CNN to help with this understanding. We made this model better by step-by-step hyperparameter tuning, and we achieved a good dice score of 0.889. However, not everything went perfectly. We tried using a model trained for English documents, but it didn't fit well with Bangla. This showed us that each language has its own challenges. Our solution for the DL Sprint 2.0 is publicly available at https://www.kaggle.com/competitions/dlsprint2/discussion/432201 along with notebooks, weights, and inference notebook.

Related papers

TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking [6.070192392563392]
We present TituLLMs, the first large pretrained Bangla LLMs, available in 1b and 3b parameter sizes. To train TituLLMs, we collected a pretraining dataset of approximately 37 billion tokens. We extended the Llama-3.2 tokenizer to incorporate language- and culture-specific knowledge.
arXiv Detail & Related papers (2025-02-16T16:22:23Z)
Document-Level Sentiment Analysis of Urdu Text Using Deep Learning Techniques [0.0]
Document level Urdu Sentiment Analysis (SA) is a challenging Natural Language Processing (NLP) task. Deep learning (DL) models comprise of complex neural network architectures that have the ability to learn diverse features of the data to classify various sentiments. In this paper, we have proposed a hybrid model that integrates BiLSTM with Single Layer Multi Filter Convolutional Neural Network (BiLSTM-SLMFCNN) Results of these techniques are evaluated and our proposed model outperforms all other DL techniques for Urdu SA.
arXiv Detail & Related papers (2025-01-23T21:25:37Z)
PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling [63.93112754821312]
Document understanding is a challenging task to process and comprehend large amounts of textual and visual information. Recent advances in Large Language Models (LLMs) have significantly improved the performance of this task. We introduce PDF-WuKong, a multimodal large language model (MLLM) which is designed to enhance multimodal question-answering (QA) for long PDF documents.
arXiv Detail & Related papers (2024-10-08T12:17:42Z)
LiLiuM: eBay's Large Language Models for e-commerce [6.819297537500464]
We introduce the LiLiuM series of large language models (LLMs): 1B, 7B, and 13B parameter models developed 100% in-house. This gives eBay full control over all aspects of the models including license, data, vocabulary, and architecture. The LiLiuM LLMs have been trained on 3 trillion tokens of multilingual text from general and e-commerce domain.
arXiv Detail & Related papers (2024-06-17T18:45:41Z)
Bengali Document Layout Analysis with Detectron2 [0.0]
Document layout analysis involves segmenting documents into meaningful units like text boxes, paragraphs, images, and tables. We improved the accuracy of the DLA model for Bengali documents by utilizing advanced Mask R-CNN models available in the Detectron2 library. Results show the effectiveness of these models in accurately segmenting Bengali documents.
arXiv Detail & Related papers (2023-08-26T05:29:09Z)
Framework and Model Analysis on Bengali Document Layout Analysis Dataset: BaDLAD [0.7925493098304448]
This study focuses on understanding Bengali Document Layouts using advanced computer programs: Detectron2, YOLOv8, and SAM. By comparing their accuracy and speed, we learned which one is good for different types of documents.
arXiv Detail & Related papers (2023-08-15T07:52:24Z)
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding [55.4806974284156]
Document understanding refers to automatically extract, analyze and comprehend information from digital documents, such as a web page. Existing Multi-model Large Language Models (MLLMs) have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition.
arXiv Detail & Related papers (2023-07-04T11:28:07Z)
Document Layout Annotation: Database and Benchmark in the Domain of Public Affairs [62.38140271294419]
We propose a procedure to semi-automatically annotate digital documents with different layout labels. We collect a novel database for DLA in the public affairs domain using a set of 24 data sources from the Spanish Administration. The results of our experiments validate the proposed text labeling procedure with accuracy up to 99%.
arXiv Detail & Related papers (2023-06-12T08:21:50Z)
Translate to Disambiguate: Zero-shot Multilingual Word Sense Disambiguation with Pretrained Language Models [67.19567060894563]
Pretrained Language Models (PLMs) learn rich cross-lingual knowledge and can be finetuned to perform well on diverse tasks. We present a new study investigating how well PLMs capture cross-lingual word sense with Contextual Word-Level Translation (C-WLT) We find that as the model size increases, PLMs encode more cross-lingual word sense knowledge and better use context to improve WLT performance.
arXiv Detail & Related papers (2023-04-26T19:55:52Z)
VTLayout: Fusion of Visual and Text Features for Document Layout Analysis [5.836306027133707]
Document layout analysis (DLA) has the potential to capture rich information in historical or scientific documents on a large scale. This paper proposes a VT model fusing the documents' deep visual, shallow visual, and text features to identify category blocks. The identification capability of the VT is superior to the most advanced method of DLA based on the PubLayNet dataset, and the F1 score is as high as 0.9599.
arXiv Detail & Related papers (2021-08-12T17:12:11Z)
MexPub: Deep Transfer Learning for Metadata Extraction from German Publications [1.1549572298362785]
We present a method that extracts metadata from PDF documents with different layouts and styles by viewing the document as an image. Our method achieved an average accuracy of around $90%$ which validates its capability to accurately extract metadata from a variety of PDF documents.
arXiv Detail & Related papers (2021-06-04T09:43:48Z)
LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding [49.941806975280045]
Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks. We present text-bfLMv2 by pre-training text, layout and image in a multi-modal framework.
arXiv Detail & Related papers (2020-12-29T13:01:52Z)
LayoutLM: Pre-training of Text and Layout for Document Image Understanding [108.12766816023783]
We propose the textbfLM to jointly model interactions between text and layout information across scanned document images. This is the first time that text and layout are jointly learned in a single framework for document-level pre-training. It achieves new state-of-the-art results in several downstream tasks, including form understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image classification (from 93.07 to 94.42)
arXiv Detail & Related papers (2019-12-31T14:31:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.