Related papers: Towards Khmer Scene Document Layout Detection

Towards Khmer Scene Document Layout Detection

URL: http://arxiv.org/abs/2603.00707v1
Date: Sat, 28 Feb 2026 15:30:16 GMT
Title: Towards Khmer Scene Document Layout Detection
Authors: Marry Kong, Rina Buoy, Sovisal Chenda, Nguonly Taing, Masakazu Iwamura, Koichi Kise,
Abstract summary: We present the first comprehensive study on Khmer scene document layout detection.<n>We contribute a novel framework comprising three key elements: (1) a robust training and benchmarking dataset specifically for Khmer scene layouts; (2) an open-source document augmentation tool capable of synthesizing realistic scene documents to scale training data; and (3) layout detection baselines utilizing YOLO-based architectures with oriented bounding boxes (OBB) to handle geometric distortions.
Score: 3.5477182055025107
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: While document layout analysis for Latin scripts has advanced significantly, driven by the advent of large multimodal models (LMMs), progress for the Khmer language remains constrained because of the scarcity of annotated training data. This gap is particularly acute for scene documents, where perspective distortions and complex backgrounds challenge traditional methods. Given the structural complexities of Khmer script, such as diacritics and multi-layer character stacking, existing Latin-based layout analysis models fail to accurately delineate semantic layout units, particularly for dense text regions (e.g., list items). In this paper, we present the first comprehensive study on Khmer scene document layout detection. We contribute a novel framework comprising three key elements: (1) a robust training and benchmarking dataset specifically for Khmer scene layouts; (2) an open-source document augmentation tool capable of synthesizing realistic scene documents to scale training data; and (3) layout detection baselines utilizing YOLO-based architectures with oriented bounding boxes (OBB) to handle geometric distortions. To foster further research in the Khmer document analysis and recognition (DAR) community, we release our models, code, and datasets in this gated repository (in review).

Related papers

Structure-Aware Text Recognition for Ancient Greek Critical Editions [16.43811675687955]
This paper investigates structure-aware text recognition for Ancient Greek critical editions.<n>We introduce a large-scale synthetic corpus of 185,000 page images generated from TEI/XML sources with controlled typographic and layout variation.<n>We evaluate three state-of-the-art visual language models under both zero-shot and fine-tuning regimes.
arXiv Detail & Related papers (2026-03-03T09:42:43Z)
KH-FUNSD: A Hierarchical and Fine-Grained Layout Analysis Dataset for Low-Resource Khmer Business Document [11.302542266122579]
Khmer is a language spoken daily by over 17 million people in Cambodia.<n>Lack of dedicated resources is particularly acute for business documents.<n>We present textbfKH-FUNSD, the first publicly available dataset for Khmer form document understanding.
arXiv Detail & Related papers (2025-12-04T13:28:44Z)
SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension [77.93156509994994]
We show how to represent short chunks in a way that is conditioned on a broader context window to enhance retrieval performance.<n>Existing embedding models are not well-equipped to encode such situated context effectively.<n>Our method substantially outperforms state-of-the-art embedding models.
arXiv Detail & Related papers (2025-08-03T23:59:31Z)
Discourse Features Enhance Detection of Document-Level Machine-Generated Content [53.41994768824785]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation.<n>Existing MGC detectors often focus solely on surface-level information, overlooking implicit and structural features.<n>We introduce novel methodologies and datasets to overcome these challenges.
arXiv Detail & Related papers (2024-12-17T08:47:41Z)
KhmerST: A Low-Resource Khmer Scene Text Detection and Recognition Benchmark [1.5409800688911346]
We introduce the first Khmer scene-text dataset, featuring 1,544 expert-annotated images. This diverse dataset includes flat text, raised text, poorly illuminated text, distant polygon and partially obscured text.
arXiv Detail & Related papers (2024-10-23T21:04:24Z)
XFormParser: A Simple and Effective Multimodal Multilingual Semi-structured Form Parser [32.62155069664013]
We introduce textbfMultilingual semi-structured textbfXForm textbfPARSER (textbfXForm), which anchored on a comprehensive pre-trained language model.<n>We also develop InDFormSFT, a dataset that specifically addresses the parsing needs of forms in various industrial contexts.
arXiv Detail & Related papers (2024-05-27T16:37:17Z)
Exploring Large Language Models and Hierarchical Frameworks for Classification of Large Unstructured Legal Documents [0.6349503549199403]
We explore the classification of large legal documents and their lack of structural information with a deep-learning-based hierarchical framework. Specifically, we divide a document into parts to extract their embeddings from the last four layers of a custom fine-tuned Large Language Model. Our approach achieves a minimum total performance gain of approximately 2 points over previous state-of-the-art methods.
arXiv Detail & Related papers (2024-03-11T16:24:08Z)
PARAGRAPH2GRAPH: A GNN-based framework for layout paragraph analysis [6.155943751502232]
We present a language-independent graph neural network (GNN)-based model that achieves competitive results on common document layout datasets. Our model is suitable for industrial applications, particularly in multi-language scenarios.
arXiv Detail & Related papers (2023-04-24T03:54:48Z)
RDU: A Region-based Approach to Form-style Document Understanding [69.29541701576858]
Key Information Extraction (KIE) is aimed at extracting structured information from form-style documents. We develop a new KIE model named Region-based Understanding Document (RDU) RDU takes as input the text content and corresponding coordinates of a document, and tries to predict the result by localizing a bounding-box-like region.
arXiv Detail & Related papers (2022-06-14T14:47:48Z)
Towards End-to-End Unified Scene Text Detection and Layout Analysis [60.68100769639923]
We introduce the task of unified scene text detection and layout analysis. The first hierarchical scene text dataset is introduced to enable this novel research task. We also propose a novel method that is able to simultaneously detect scene text and form text clusters in a unified way.
arXiv Detail & Related papers (2022-03-28T23:35:45Z)
SMDT: Selective Memory-Augmented Neural Document Translation [53.4627288890316]
We propose a Selective Memory-augmented Neural Document Translation model to deal with documents containing large hypothesis space of context. We retrieve similar bilingual sentence pairs from the training corpus to augment global context. We extend the two-stream attention model with selective mechanism to capture local context and diverse global contexts.
arXiv Detail & Related papers (2022-01-05T14:23:30Z)
Digital Editions as Distant Supervision for Layout Analysis of Printed Books [76.29918490722902]
We describe methods for exploiting this semantic markup as distant supervision for training and evaluating layout analysis models. In experiments with several model architectures on the half-million pages of the Deutsches Textarchiv (DTA), we find a high correlation of these region-level evaluation methods with pixel-level and word-level metrics. We discuss the possibilities for improving accuracy with self-training and the ability of models trained on the DTA to generalize to other historical printed books.
arXiv Detail & Related papers (2021-12-23T16:51:53Z)
Towards Making the Most of Context in Neural Machine Translation [112.9845226123306]
We argue that previous research did not make a clear use of the global context. We propose a new document-level NMT framework that deliberately models the local context of each sentence.
arXiv Detail & Related papers (2020-02-19T03:30:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.