Related papers: TextDoctor: Unified Document Image Inpainting via Patch Pyramid Diffusion Models

TextDoctor: Unified Document Image Inpainting via Patch Pyramid Diffusion Models

URL: http://arxiv.org/abs/2503.04021v1
Date: Thu, 06 Mar 2025 02:16:35 GMT
Title: TextDoctor: Unified Document Image Inpainting via Patch Pyramid Diffusion Models
Authors: Wanglong Lu, Lingming Su, Jingjing Zheng, Vinícius Veloso de Melo, Farzaneh Shoeleh, John Hawkin, Terrence Tricco, Hanli Zhao, Xianta Jiang,
Abstract summary: We introduce TextDoctor, a novel unified document image inpainting method.<n>Inspired by human reading behavior, TextDoctor restores fundamental text elements from patches.<n>We propose using structure pyramid prediction and patch pyramid diffusion models to handle varying text sizes.
Score: 3.158608366563426
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Digital versions of real-world text documents often suffer from issues like environmental corrosion of the original document, low-quality scanning, or human interference. Existing document restoration and inpainting methods typically struggle with generalizing to unseen document styles and handling high-resolution images. To address these challenges, we introduce TextDoctor, a novel unified document image inpainting method. Inspired by human reading behavior, TextDoctor restores fundamental text elements from patches and then applies diffusion models to entire document images instead of training models on specific document types. To handle varying text sizes and avoid out-of-memory issues, common in high-resolution documents, we propose using structure pyramid prediction and patch pyramid diffusion models. These techniques leverage multiscale inputs and pyramid patches to enhance the quality of inpainting both globally and locally. Extensive qualitative and quantitative experiments on seven public datasets validated that TextDoctor outperforms state-of-the-art methods in restoring various types of high-resolution document images.

Related papers

TextInVision: Text and Prompt Complexity Driven Visual Text Generation Benchmark [61.412934963260724]
Existing diffusion-based text-to-image models often struggle to accurately embed text within images. We introduce TextInVision, a large-scale, text and prompt complexity driven benchmark to evaluate the ability of diffusion models to integrate visual text into images.
arXiv Detail & Related papers (2025-03-17T21:36:31Z)
Text Image Inpainting via Global Structure-Guided Diffusion Models [22.859984320894135]
Real-world text can be damaged by corrosion issues caused by environmental or human factors. Current inpainting techniques often fail to adequately address this problem. We develop a novel neural framework, Global Structure-guided Diffusion Model (GSDM), as a potential solution.
arXiv Detail & Related papers (2024-01-26T13:01:28Z)
ENTED: Enhanced Neural Texture Extraction and Distribution for Reference-based Blind Face Restoration [51.205673783866146]
We present ENTED, a new framework for blind face restoration that aims to restore high-quality and realistic portrait images. We utilize a texture extraction and distribution framework to transfer high-quality texture features between the degraded input and reference image. The StyleGAN-like architecture in our framework requires high-quality latent codes to generate realistic images.
arXiv Detail & Related papers (2024-01-13T04:54:59Z)
UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models [25.219960711604728]
This paper proposes a novel approach for text image generation, utilizing a pre-trained diffusion model. Our approach involves the design and training of a light-weight character-level text encoder, which replaces the original CLIP encoder. By employing an inference stage refinement process, we achieve a notably high sequence accuracy when synthesizing text in arbitrarily given images.
arXiv Detail & Related papers (2023-12-08T07:47:46Z)
DocMAE: Document Image Rectification via Self-supervised Representation Learning [144.44748607192147]
We present DocMAE, a novel self-supervised framework for document image rectification. We first mask random patches of the background-excluded document images and then reconstruct the missing pixels. With such a self-supervised learning approach, the network is encouraged to learn the intrinsic structure of deformed documents.
arXiv Detail & Related papers (2023-04-20T14:27:15Z)
Deep Unrestricted Document Image Rectification [110.61517455253308]
We present DocTr++, a novel unified framework for document image rectification. We upgrade the original architecture by adopting a hierarchical encoder-decoder structure for multi-scale representation extraction and parsing. We contribute a real-world test set and metrics applicable for evaluating the rectification quality.
arXiv Detail & Related papers (2023-04-18T08:00:54Z)
Unified Multi-Modal Latent Diffusion for Joint Subject and Text Conditional Image Generation [63.061871048769596]
We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences. To be more specific, both input texts and images are encoded into one unified multi-modal latent space. Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
arXiv Detail & Related papers (2023-03-16T13:50:20Z)
Unifying Vision, Text, and Layout for Universal Document Processing [105.36490575974028]
We propose a Document AI model which unifies text, image, and layout modalities together with varied task formats, including document understanding and generation. Our method sets the state-of-the-art on 9 Document AI tasks, e.g., document understanding and QA, across diverse data domains like finance reports, academic papers, and websites.
arXiv Detail & Related papers (2022-12-05T22:14:49Z)
Enhance to Read Better: An Improved Generative Adversarial Network for Handwritten Document Image Enhancement [1.7491858164568674]
We propose an end to end architecture based on Generative Adversarial Networks (GANs) to recover degraded documents into a clean and readable form. To the best of our knowledge, this is the first work to use the text information while binarizing handwritten documents. We outperform the state of the art in H-DIBCO 2018 challenge, after fine tuning our pre-trained model with synthetically degraded Latin handwritten images.
arXiv Detail & Related papers (2021-05-26T17:44:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.