Related papers: IndicDLP: A Foundational Dataset for Multi-Lingual and Multi-Domain Document Layout Parsing

IndicDLP: A Foundational Dataset for Multi-Lingual and Multi-Domain Document Layout Parsing

URL: http://arxiv.org/abs/2512.20236v1
Date: Tue, 23 Dec 2025 10:49:37 GMT
Title: IndicDLP: A Foundational Dataset for Multi-Lingual and Multi-Domain Document Layout Parsing
Authors: Oikantik Nath, Sahithi Kukkala, Mitesh Khapra, Ravi Kiran Sarvadevabhatla,
Abstract summary: IndicDLP is a large-scale foundational document layout dataset spanning 11 representative Indic languages alongside English and 12 common document domains.<n>Our experiments demonstrate that fine-tuning existing English models on IndicDLP significantly boosts performance.<n>Models trained on IndicDLP generalize well beyond Indic layouts, making it a valuable resource for document digitization.
Score: 8.447520506307567
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Document layout analysis is essential for downstream tasks such as information retrieval, extraction, OCR, and digitization. However, existing large-scale datasets like PubLayNet and DocBank lack fine-grained region labels and multilingual diversity, making them insufficient for representing complex document layouts. In contrast, human-annotated datasets such as M6Doc and D4LA offer richer labels and greater domain diversity, but are too small to train robust models and lack adequate multilingual coverage. This gap is especially pronounced for Indic documents, which encompass diverse scripts yet remain underrepresented in current datasets, further limiting progress in this space. To address these shortcomings, we introduce IndicDLP, a large-scale foundational document layout dataset spanning 11 representative Indic languages alongside English and 12 common document domains. Additionally, we curate UED-mini, a dataset derived from DocLayNet and M6Doc, to enhance pretraining and provide a solid foundation for Indic layout models. Our experiments demonstrate that fine-tuning existing English models on IndicDLP significantly boosts performance, validating its effectiveness. Moreover, models trained on IndicDLP generalize well beyond Indic layouts, making it a valuable resource for document digitization. This work bridges gaps in scale, diversity, and annotation granularity, driving inclusive and efficient document understanding.

Related papers

Docopilot: Improving Multimodal Models for Document-Level Understanding [87.60020625241178]
We present a high-quality document-level dataset, Doc-750K, designed to support in-depth understanding of multimodal documents.<n>This dataset includes diverse document structures, extensive cross-page dependencies, and real question-answer pairs derived from the original documents.<n>Building on the dataset, we develop a native multimodal model, Docopilot, which can accurately handle document-level dependencies without relying on RAG.
arXiv Detail & Related papers (2025-07-19T16:03:34Z)
BuDDIE: A Business Document Dataset for Multi-task Information Extraction [18.440587946049845]
BuDDIE is the first multi-task dataset of 1,665 real-world business documents. Our dataset consists of publicly available business entity documents from US state government websites.
arXiv Detail & Related papers (2024-04-05T10:26:42Z)
DocLLM: A layout-aware generative language model for multimodal document understanding [12.093889265216205]
We present DocLLM, a lightweight extension to traditional large language models (LLMs) for reasoning over visual documents. Our model focuses exclusively on bounding box information to incorporate the spatial layout structure. We demonstrate that our solution outperforms SotA LLMs on 14 out of 16 datasets across all tasks, and generalizes well to 4 out of 5 previously unseen datasets.
arXiv Detail & Related papers (2023-12-31T22:37:52Z)
A Multi-Modal Multilingual Benchmark for Document Image Classification [21.7518357653137]
We introduce two newly curated multilingual datasets WIKI-DOC and MULTIEUR-DOCLEX. We study popular visually-rich document understanding or Document AI models in previously untested setting in document image classification. Experimental results show limitations of multilingual Document AI models on cross-lingual transfer across typologically distant languages.
arXiv Detail & Related papers (2023-10-25T04:35:06Z)
Multi3WOZ: A Multilingual, Multi-Domain, Multi-Parallel Dataset for Training and Evaluating Culturally Adapted Task-Oriented Dialog Systems [64.40789703661987]
Multi3WOZ is a novel multilingual, multi-domain, multi-parallel ToD dataset. It is large-scale and offers culturally adapted dialogs in 4 languages. We describe a complex bottom-up data collection process that yielded the final dataset.
arXiv Detail & Related papers (2023-07-26T08:29:42Z)
DocumentNet: Bridging the Data Gap in Document Pre-Training [78.01647768018485]
We propose a method to collect massive-scale and weakly labeled data from the web to benefit the training of VDER models. The collected dataset, named DocumentNet, does not depend on specific document types or entity sets. Experiments on a set of broadly adopted VDER tasks show significant improvements when DocumentNet is incorporated into the pre-training.
arXiv Detail & Related papers (2023-06-15T08:21:15Z)
PARAGRAPH2GRAPH: A GNN-based framework for layout paragraph analysis [6.155943751502232]
We present a language-independent graph neural network (GNN)-based model that achieves competitive results on common document layout datasets. Our model is suitable for industrial applications, particularly in multi-language scenarios.
arXiv Detail & Related papers (2023-04-24T03:54:48Z)
MGDoc: Pre-training with Multi-granular Hierarchy for Document Image Understanding [53.03978356918377]
spatial hierarchical relationships between content at different levels of granularity are crucial for document image understanding tasks. Existing methods learn features from either word-level or region-level but fail to consider both simultaneously. We propose MGDoc, a new multi-modal multi-granular pre-training framework that encodes page-level, region-level, and word-level information at the same time.
arXiv Detail & Related papers (2022-11-27T22:47:37Z)
Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding. UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input. An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z)
CDA: a Cost Efficient Content-based Multilingual Web Document Aligner [97.98885151955467]
We introduce a Content-based Document Alignment approach to align multilingual web documents based on content. We leverage lexical translation models to build vector representations using TF-IDF. Experiments show that CDA is robust, cost-effective, and is significantly superior in (i) processing large and noisy web data and (ii) scaling to new and low-resourced languages.
arXiv Detail & Related papers (2021-02-20T03:37:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.