TextBoost: Boosting Scene Text Fidelity in Ultra-low Bitrate Image Compression
- URL: http://arxiv.org/abs/2603.04115v1
- Date: Wed, 04 Mar 2026 14:35:10 GMT
- Title: TextBoost: Boosting Scene Text Fidelity in Ultra-low Bitrate Image Compression
- Authors: Bingxin Wang, Yuan Lan, Zhaoyi Sun, Yang Xiang, Jie Sun,
- Abstract summary: Region-of-interest bit allocation can prioritize text but degrades global fidelity, leading to a trade-off between local accuracy and overall image quality.<n>We incorporate auxiliary textual information extracted by OCR and transmitted with negligible overhead, enabling the decoder to leverage this semantic guidance.<n>Tests on TextOCR and ICDAR 2015 demonstrate that TextBoost yields up to 60.6% higher text-recognition F1 at comparable Peak Signal-to-Noise Ratio (PSNR) and bits per pixel (bpp)
- Score: 11.661973720343546
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Ultra-low bitrate image compression faces a critical challenge: preserving small-font scene text while maintaining overall visual quality. Region-of-interest (ROI) bit allocation can prioritize text but often degrades global fidelity, leading to a trade-off between local accuracy and overall image quality. Instead of relying on ROI coding, we incorporate auxiliary textual information extracted by OCR and transmitted with negligible overhead, enabling the decoder to leverage this semantic guidance. Our method, TextBoost, operationalizes this idea through three strategic designs: (i) adaptively filtering OCR outputs and rendering them into a guidance map; (ii) integrating this guidance with decoder features in a calibrated manner via an attention-guided fusion block; and (iii) enforcing guidance-consistent reconstruction in text regions with a regularizing loss that promotes natural blending with the scene. Extensive experiments on TextOCR and ICDAR 2015 demonstrate that TextBoost yields up to 60.6% higher text-recognition F1 at comparable Peak Signal-to-Noise Ratio (PSNR) and bits per pixel (bpp), producing sharper small-font text while preserving global image quality and effectively decoupling text enhancement from global rate-distortion optimization.
Related papers
- TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering [76.53315206999231]
TextPecker is a plug-and-play structural anomaly perceptive RL strategy.<n>It mitigates noisy reward signals and works with any textto-image generators.<n>It significantly yields average gains of 4% in structural fidelity and 8.7% in semantic alignment for Chinese text rendering.
arXiv Detail & Related papers (2026-02-24T13:40:23Z) - TextGuider: Training-Free Guidance for Text Rendering via Attention Alignment [68.91073792449201]
We propose TextGuider, a training-free method that encourages accurate and complete text appearance.<n>Specifically, we analyze attention patterns in Multi-Modal Diffusion Transformer(MM-DiT) models, particularly for text-related tokens intended to be rendered in the image.<n>Our method achieves state-of-the-art performance in test-time text rendering, with significant gains in recall and strong results in OCR accuracy and CLIP score.
arXiv Detail & Related papers (2025-12-10T06:18:30Z) - Unified Diffusion Transformer for High-fidelity Text-Aware Image Restoration [36.43437855052787]
Text-Aware Image Restoration (TAIR) aims to recover high- quality images from low-quality inputs containing degraded textual content.<n>We propose UniT, a unified text restoration framework that in- tegrates a Diffusion Transformer (DiT), a Vision-Language Model (VLM), and a Text Spotting Module (TSM)<n>Experiments on the SA-Text and Real-Text benchmarks demonstrate that UniT faithfully reconstructs degraded text, substantially reduces hallucinations, and achieves state-of-the-art end-to-end F1-score performance in TAIR task.
arXiv Detail & Related papers (2025-12-09T18:56:54Z) - DCText: Scheduled Attention Masking for Visual Text Generation via Divide-and-Conquer Strategy [41.781258763025896]
DCText is a training-free visual text generation method that adopts a divide-and-conquer strategy.<n>Our method first decomposes a prompt by extracting and dividing the target text, then assigns each to a designated region.<n>Experiments on single- and multisentence benchmarks show that DCText achieves the best text accuracy without compromising image quality.
arXiv Detail & Related papers (2025-12-01T05:52:55Z) - Text-Aware Real-World Image Super-Resolution via Diffusion Model with Joint Segmentation Decoders [14.655107789528673]
We introduce a novel diffusion-based SR framework, namely TADiSR, which integrates text-aware attention and joint segmentation decoders.<n>We propose a complete pipeline for synthesizing high-quality images with fine-grained full-image text masks.<n>Our approach substantially enhances text legibility in super-resolved images, achieving state-of-the-art performance across multiple evaluation metrics.
arXiv Detail & Related papers (2025-06-05T05:23:10Z) - Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets.
We introduce a novel method named Decoder Pre-training with only text for STR (DPTR)
DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z) - Neural Image Compression with Text-guided Encoding for both Pixel-level and Perceptual Fidelity [18.469136842357095]
We develop a new text-guided image compression algorithm that achieves both high perceptual and pixel-wise fidelity.
By doing so, we avoid decoding based on text-guided generative models.
Our method can achieve high pixel-level and perceptual quality, with either human- or machine-generated captions.
arXiv Detail & Related papers (2024-03-05T13:15:01Z) - Scene Text Image Super-Resolution via Content Perceptual Loss and
Criss-Cross Transformer Blocks [48.81850740907517]
We present TATSR, a Text-Aware Text Super-Resolution framework.
It effectively learns the unique text characteristics using Criss-Cross Transformer Blocks (CCTBs) and a novel Content Perceptual (CP) Loss.
It outperforms state-of-the-art methods in terms of both recognition accuracy and human perception.
arXiv Detail & Related papers (2022-10-13T11:48:45Z) - DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis [80.54273334640285]
We propose a novel one-stage text-to-image backbone that directly synthesizes high-resolution images without entanglements between different generators.
We also propose a novel Target-Aware Discriminator composed of Matching-Aware Gradient Penalty and One-Way Output.
Compared with current state-of-the-art methods, our proposed DF-GAN is simpler but more efficient to synthesize realistic and text-matching images.
arXiv Detail & Related papers (2020-08-13T12:51:17Z) - Scene Text Image Super-Resolution in the Wild [112.90416737357141]
Low-resolution text images are often seen in natural scenes such as documents captured by mobile phones.
Previous single image super-resolution (SISR) methods are trained on synthetic low-resolution images.
We pro-pose a real scene text SR dataset, termed TextZoom.
It contains paired real low-resolution and high-resolution images captured by cameras with different focal length in the wild.
arXiv Detail & Related papers (2020-05-07T09:18:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.