MFE-GAN: Efficient GAN-based Framework for Document Image Enhancement and Binarization with Multi-scale Feature Extraction
- URL: http://arxiv.org/abs/2512.14114v1
- Date: Tue, 16 Dec 2025 05:54:27 GMT
- Title: MFE-GAN: Efficient GAN-based Framework for Document Image Enhancement and Binarization with Multi-scale Feature Extraction
- Authors: Rui-Yang Ju, KokSheik Wong, Yanlin Jin, Jen-Shiun Chiang,
- Abstract summary: MFE-GAN is an efficient GAN-based framework with multi-scale feature extraction.<n>We present novel generators, discriminators, and loss functions to improve the model's performance.
- Score: 7.031239620427525
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Document image enhancement and binarization are commonly performed prior to document analysis and recognition tasks for improving the efficiency and accuracy of optical character recognition (OCR) systems. This is because directly recognizing text in degraded documents, particularly in color images, often results in unsatisfactory recognition performance. To address these issues, existing methods train independent generative adversarial networks (GANs) for different color channels to remove shadows and noise, which, in turn, facilitates efficient text information extraction. However, deploying multiple GANs results in long training and inference times. To reduce both training and inference times of document image enhancement and binarization models, we propose MFE-GAN, an efficient GAN-based framework with multi-scale feature extraction (MFE), which incorporates Haar wavelet transformation (HWT) and normalization to process document images before feeding them into GANs for training. In addition, we present novel generators, discriminators, and loss functions to improve the model's performance, and we conduct ablation studies to demonstrate their effectiveness. Experimental results on the Benchmark, Nabuco, and CMATERdb datasets demonstrate that the proposed MFE-GAN significantly reduces the total training and inference times while maintaining comparable performance with respect to state-of-the-art (SOTA) methods. The implementation of this work is available at https://ruiyangju.github.io/MFE-GAN.
Related papers
- Data Factory with Minimal Human Effort Using VLMs [35.30747487237989]
We introduce a training-free pipeline that integrates pretrained ControlNet and Vision-Language Models (VLMs) to generate synthetic images paired with pixel-level labels.<n>This approach eliminates the need for manual annotations and significantly improves downstream tasks.<n>Our results on PASCAL-5i and COCO-20i present promising performance and outperform concurrent work for one-shot semantic segmentation.
arXiv Detail & Related papers (2025-10-07T09:43:24Z) - Underlying Semantic Diffusion for Effective and Efficient In-Context Learning [113.4003355229632]
Underlying Semantic Diffusion (US-Diffusion) is an enhanced diffusion model that boosts underlying semantics learning, computational efficiency, and in-context learning capabilities.<n>We present a Feedback-Aided Learning (FAL) framework, which leverages feedback signals to guide the model in capturing semantic details.<n>We also propose a plug-and-play Efficient Sampling Strategy (ESS) for dense sampling at time steps with high-noise levels.
arXiv Detail & Related papers (2025-03-06T03:06:22Z) - Efficient GANs for Document Image Binarization Based on DWT and Normalization [7.597556504891501]
generative adversarial networks (GANs) can generate images where shadows and noise are effectively removed, which allow for text information extraction.
This work introduces an efficient GAN method based on the three-stage network architecture that incorporates the Discrete Wavelet Transformation and normalization.
Experimental results show that the proposed method reduces the training time by 10% and the inference time by 26% when compared to the SOTA method.
arXiv Detail & Related papers (2024-07-05T03:19:32Z) - Efficient Visual State Space Model for Image Deblurring [99.54894198086852]
Convolutional neural networks (CNNs) and Vision Transformers (ViTs) have achieved excellent performance in image restoration.<n>We propose a simple yet effective visual state space model (EVSSM) for image deblurring.<n>The proposed EVSSM performs favorably against state-of-the-art methods on benchmark datasets and real-world images.
arXiv Detail & Related papers (2024-05-23T09:13:36Z) - E$^{2}$GAN: Efficient Training of Efficient GANs for Image-to-Image Translation [69.72194342962615]
We introduce and address a novel research direction: can the process of distilling GANs from diffusion models be made significantly more efficient?
First, we construct a base GAN model with generalized features, adaptable to different concepts through fine-tuning, eliminating the need for training from scratch.
Second, we identify crucial layers within the base GAN model and employ Low-Rank Adaptation (LoRA) with a simple yet effective rank search process, rather than fine-tuning the entire base model.
Third, we investigate the minimal amount of data necessary for fine-tuning, further reducing the overall training time.
arXiv Detail & Related papers (2024-01-11T18:59:14Z) - DGNet: Dynamic Gradient-Guided Network for Water-Related Optics Image
Enhancement [77.0360085530701]
Underwater image enhancement (UIE) is a challenging task due to the complex degradation caused by underwater environments.
Previous methods often idealize the degradation process, and neglect the impact of medium noise and object motion on the distribution of image features.
Our approach utilizes predicted images to dynamically update pseudo-labels, adding a dynamic gradient to optimize the network's gradient space.
arXiv Detail & Related papers (2023-12-12T06:07:21Z) - X-Transfer: A Transfer Learning-Based Framework for GAN-Generated Fake
Image Detection [33.31312811230408]
misuse of GANs for generating deceptive images, such as face replacement, raises significant security concerns.
This paper introduces a novel GAN-generated image detection algorithm called X-Transfer.
It enhances transfer learning by utilizing two neural networks that employ interleaved parallel gradient transmission.
arXiv Detail & Related papers (2023-10-07T01:23:49Z) - CCDWT-GAN: Generative Adversarial Networks Based on Color Channel Using
Discrete Wavelet Transform for Document Image Binarization [3.0175628677371935]
This paper introduces a novelty method employing generative adversarial networks based on color channel.
The proposed method involves three stages: image preprocessing, image enhancement, and image binarization.
The experimental results demonstrate that CCDWT-GAN achieves a top two performance on multiple benchmark datasets.
arXiv Detail & Related papers (2023-05-27T08:55:56Z) - Multimodal Data Augmentation for Image Captioning using Diffusion Models [12.221685807426264]
We propose a data augmentation method, leveraging a text-to-image model called Stable Diffusion, to expand the training set.
Experiments on the MS COCO dataset demonstrate the advantages of our approach over several benchmark methods.
Further improvement regarding the training efficiency and effectiveness can be obtained after intentionally filtering the generated data.
arXiv Detail & Related papers (2023-05-03T01:57:33Z) - Improving GAN Training via Feature Space Shrinkage [69.98365478398593]
We propose AdaptiveMix, which shrinks regions of training data in the image representation space of the discriminator.
Considering it is intractable to directly bound feature space, we propose to construct hard samples and narrow down the feature distance between hard and easy samples.
The evaluation results demonstrate that our AdaptiveMix can facilitate the training of GANs and effectively improve the image quality of generated samples.
arXiv Detail & Related papers (2023-03-02T20:22:24Z) - Joint Deep Learning of Facial Expression Synthesis and Recognition [97.19528464266824]
We propose a novel joint deep learning of facial expression synthesis and recognition method for effective FER.
The proposed method involves a two-stage learning procedure. Firstly, a facial expression synthesis generative adversarial network (FESGAN) is pre-trained to generate facial images with different facial expressions.
In order to alleviate the problem of data bias between the real images and the synthetic images, we propose an intra-class loss with a novel real data-guided back-propagation (RDBP) algorithm.
arXiv Detail & Related papers (2020-02-06T10:56:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.