Related papers: UnSupDLA: Towards Unsupervised Document Layout Analysis

UnSupDLA: Towards Unsupervised Document Layout Analysis

URL: http://arxiv.org/abs/2406.06236v1
Date: Mon, 10 Jun 2024 13:06:28 GMT
Title: UnSupDLA: Towards Unsupervised Document Layout Analysis
Authors: Talha Uddin Sheikh, Tahira Shehzadi, Khurram Azeem Hashmi, Didier Stricker, Muhammad Zeshan Afzal,
Abstract summary: A critical but frequently overlooked problem is the scarcity of labeled data needed for layout analysis. We employ a vision-based approach for analyzing document layouts designed to train a network without labels.
Score: 11.574592219976823
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Document layout analysis is a key area in document research, involving techniques like text mining and visual analysis. Despite various methods developed to tackle layout analysis, a critical but frequently overlooked problem is the scarcity of labeled data needed for analyses. With the rise of internet use, an overwhelming number of documents are now available online, making the process of accurately labeling them for research purposes increasingly challenging and labor-intensive. Moreover, the diversity of documents online presents a unique set of challenges in maintaining the quality and consistency of these labels, further complicating document layout analysis in the digital era. To address this, we employ a vision-based approach for analyzing document layouts designed to train a network without labels. Instead, we focus on pre-training, initially generating simple object masks from the unlabeled document images. These masks are then used to train a detector, enhancing object detection and segmentation performance. The model's effectiveness is further amplified through several unsupervised training iterations, continuously refining its performance. This approach significantly advances document layout analysis, particularly precision and efficiency, without labels.

Related papers

DREAM: Document Reconstruction via End-to-end Autoregressive Model [53.51754520966657]
We present an innovative autoregressive model specifically designed for document reconstruction, referred to as Document Reconstruction via End-to-end Autoregressive Model (DREAM)<n>We establish a standardized definition of the document reconstruction task, and introduce a novel Document Similarity Metric (DSM) and DocRec1K dataset for assessing the performance of the task.
arXiv Detail & Related papers (2025-07-08T09:24:07Z)
Structured Attention Matters to Multimodal LLMs in Document Understanding [52.37530640460363]
We investigate how input format influences document comprehension performance.<n>We discover that raw OCR text often impairs rather than improves MLLMs' performance.<n>We propose a novel structure-preserving approach that encodes document elements using the LaTex paradigm.
arXiv Detail & Related papers (2025-06-19T07:16:18Z)
Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences. We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries. We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
U-DIADS-Bib: a full and few-shot pixel-precise dataset for document layout analysis of ancient manuscripts [9.76730765089929]
U-DIADS-Bib is a novel, pixel-precise, non-overlapping and noiseless document layout analysis dataset developed in close collaboration between specialists in the fields of computer vision and humanities. We propose a novel, computer-aided, segmentation pipeline in order to alleviate the burden represented by the time-consuming process of manual annotation.
arXiv Detail & Related papers (2024-01-16T15:11:18Z)
On Task-personalized Multimodal Few-shot Learning for Visually-rich Document Entity Retrieval [59.25292920967197]
Few-shot document entity retrieval (VDER) is an important topic in industrial NLP applications. FewVEX is a new dataset to boost future research in the field of entity-level few-shot VDER. We present a task-aware meta-learning based framework, with a central focus on achieving effective task personalization.
arXiv Detail & Related papers (2023-11-01T17:51:43Z)
Document AI: A Comparative Study of Transformer-Based, Graph-Based Models, and Convolutional Neural Networks For Document Layout Analysis [3.231170156689185]
Document AI aims to automatically analyze documents by leveraging natural language processing and computer vision techniques. One of the major tasks of Document AI is document layout analysis, which structures document pages by interpreting the content and spatial relationships of layout, image, and text.
arXiv Detail & Related papers (2023-08-29T16:58:03Z)
SelfDocSeg: A Self-Supervised vision-based Approach towards Document Segmentation [15.953725529361874]
Document layout analysis is a known problem to the documents research community. With growing internet connectivity to personal life, an enormous amount of documents had been available in the public domain. We address this challenge using self-supervision and unlike, the few existing self-supervised document segmentation approaches.
arXiv Detail & Related papers (2023-05-01T12:47:55Z)
Efficient few-shot learning for pixel-precise handwritten document layout analysis [11.453393410516991]
We propose an efficient few-shot learning framework for layout analysis. It achieves performances comparable to current state-of-the-art fully supervised methods on the publicly available DIVA-HisDB dataset.
arXiv Detail & Related papers (2022-10-27T16:03:52Z)
Metrics reloaded: Recommendations for image analysis validation [59.60445111432934]
Metrics Reloaded is a comprehensive framework guiding researchers in the problem-aware selection of metrics. The framework was developed in a multi-stage Delphi process and is based on the novel concept of a problem fingerprint. Based on the problem fingerprint, users are guided through the process of choosing and applying appropriate validation metrics.
arXiv Detail & Related papers (2022-06-03T15:56:51Z)
Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding. UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input. An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z)
Synthetic Document Generator for Annotation-free Layout Recognition [15.657295650492948]
We describe a synthetic document generator that automatically produces realistic documents with labels for spatial positions, extents and categories of layout elements. We empirically illustrate that a deep layout detection model trained purely on the synthetic documents can match the performance of a model that uses real documents.
arXiv Detail & Related papers (2021-11-11T01:58:44Z)
Human-in-the-Loop Disinformation Detection: Stance, Sentiment, or Something Else? [93.91375268580806]
Both politics and pandemics have recently provided ample motivation for the development of machine learning-enabled disinformation (a.k.a. fake news) detection algorithms. Existing literature has focused primarily on the fully-automated case, but the resulting techniques cannot reliably detect disinformation on the varied topics, sources, and time scales required for military applications. By leveraging an already-available analyst as a human-in-the-loop, canonical machine learning techniques of sentiment analysis, aspect-based sentiment analysis, and stance detection become plausible methods to use for a partially-automated disinformation detection system.
arXiv Detail & Related papers (2021-11-09T13:30:34Z)
One-shot Key Information Extraction from Document with Deep Partial Graph Matching [60.48651298832829]
Key Information Extraction (KIE) from documents improves efficiency, productivity, and security in many industrial scenarios. Existing supervised learning methods for the KIE task need to feed a large number of labeled samples and learn separate models for different types of documents. We propose a deep end-to-end trainable network for one-shot KIE using partial graph matching.
arXiv Detail & Related papers (2021-09-26T07:45:53Z)
Active Learning from Crowd in Document Screening [76.9545252341746]
We focus on building a set of machine learning classifiers that evaluate documents, and then screen them efficiently. We propose a multi-label active learning screening specific sampling technique -- objective-aware sampling. We demonstrate that objective-aware sampling significantly outperforms the state of the art active learning sampling strategies.
arXiv Detail & Related papers (2020-11-11T16:17:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.