Unified Diffusion Transformer for High-fidelity Text-Aware Image Restoration
- URL: http://arxiv.org/abs/2512.08922v1
- Date: Tue, 09 Dec 2025 18:56:54 GMT
- Title: Unified Diffusion Transformer for High-fidelity Text-Aware Image Restoration
- Authors: Jin Hyeon Kim, Paul Hyunbin Cho, Claire Kim, Jaewon Min, Jaeeun Lee, Jihye Park, Yeji Choi, Seungryong Kim,
- Abstract summary: Text-Aware Image Restoration (TAIR) aims to recover high- quality images from low-quality inputs containing degraded textual content.<n>We propose UniT, a unified text restoration framework that in- tegrates a Diffusion Transformer (DiT), a Vision-Language Model (VLM), and a Text Spotting Module (TSM)<n>Experiments on the SA-Text and Real-Text benchmarks demonstrate that UniT faithfully reconstructs degraded text, substantially reduces hallucinations, and achieves state-of-the-art end-to-end F1-score performance in TAIR task.
- Score: 36.43437855052787
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-Aware Image Restoration (TAIR) aims to recover high- quality images from low-quality inputs containing degraded textual content. While diffusion models provide strong gen- erative priors for general image restoration, they often pro- duce text hallucinations in text-centric tasks due to the ab- sence of explicit linguistic knowledge. To address this, we propose UniT, a unified text restoration framework that in- tegrates a Diffusion Transformer (DiT), a Vision-Language Model (VLM), and a Text Spotting Module (TSM) in an it- erative fashion for high-fidelity text restoration. In UniT, the VLM extracts textual content from degraded images to provide explicit textual guidance. Simultaneously, the TSM, trained on diffusion features, generates intermedi- ate OCR predictions at each denoising step, enabling the VLM to iteratively refine its guidance during the denoising process. Finally, the DiT backbone, leveraging its strong representational power, exploit these cues to recover fine- grained textual content while effectively suppressing text hallucinations. Experiments on the SA-Text and Real-Text benchmarks demonstrate that UniT faithfully reconstructs degraded text, substantially reduces hallucinations, and achieves state-of-the-art end-to-end F1-score performance in TAIR task.
Related papers
- TextBoost: Boosting Scene Text Fidelity in Ultra-low Bitrate Image Compression [11.661973720343546]
Region-of-interest bit allocation can prioritize text but degrades global fidelity, leading to a trade-off between local accuracy and overall image quality.<n>We incorporate auxiliary textual information extracted by OCR and transmitted with negligible overhead, enabling the decoder to leverage this semantic guidance.<n>Tests on TextOCR and ICDAR 2015 demonstrate that TextBoost yields up to 60.6% higher text-recognition F1 at comparable Peak Signal-to-Noise Ratio (PSNR) and bits per pixel (bpp)
arXiv Detail & Related papers (2026-03-04T14:35:10Z) - TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering [76.53315206999231]
TextPecker is a plug-and-play structural anomaly perceptive RL strategy.<n>It mitigates noisy reward signals and works with any textto-image generators.<n>It significantly yields average gains of 4% in structural fidelity and 8.7% in semantic alignment for Chinese text rendering.
arXiv Detail & Related papers (2026-02-24T13:40:23Z) - TextGuider: Training-Free Guidance for Text Rendering via Attention Alignment [68.91073792449201]
We propose TextGuider, a training-free method that encourages accurate and complete text appearance.<n>Specifically, we analyze attention patterns in Multi-Modal Diffusion Transformer(MM-DiT) models, particularly for text-related tokens intended to be rendered in the image.<n>Our method achieves state-of-the-art performance in test-time text rendering, with significant gains in recall and strong results in OCR accuracy and CLIP score.
arXiv Detail & Related papers (2025-12-10T06:18:30Z) - Text-Aware Image Restoration with Diffusion Models [30.127247716169666]
Text-Aware Image Restoration (TAIR) is a novel restoration task that requires the simultaneous recovery of visual contents and textual fidelity.<n>To tackle this task, we present SA-Text, a large-scale benchmark of 100K high-quality scene images densely annotated with diverse and complex text instances.<n>Our approach consistently outperforms state-of-the-art restoration methods, achieving significant gains in text recognition accuracy.
arXiv Detail & Related papers (2025-06-11T17:59:46Z) - Boosting Diffusion-Based Text Image Super-Resolution Model Towards Generalized Real-World Scenarios [30.800865323585377]
We introduce a novel framework aimed at improving the generalization ability of diffusion models for text image super-resolution (SR)<n>We propose a progressive data sampling strategy that incorporates diverse image types at different stages of training, stabilizing the convergence and improving the generalization.<n>Experiments on real-world datasets demonstrate that our approach not only produces text images with more realistic visual appearances but also improves the accuracy of text structure.
arXiv Detail & Related papers (2025-03-10T12:16:19Z) - SPIRE: Semantic Prompt-Driven Image Restoration [66.26165625929747]
We develop SPIRE, a Semantic and restoration Prompt-driven Image Restoration framework.
Our approach is the first framework that supports fine-level instruction through language-based quantitative specification of the restoration strength.
Our experiments demonstrate the superior restoration performance of SPIRE compared to the state of the arts.
arXiv Detail & Related papers (2023-12-18T17:02:30Z) - Image Super-Resolution with Text Prompt Diffusion [118.023531454099]
We introduce text prompts to image SR to provide degradation priors.<n>PromptSR leverages the latest multi-modal large language model (MLLM) to generate prompts from low-resolution images.<n>Experiments indicate that introducing text prompts into SR, yields impressive results on both synthetic and real-world images.
arXiv Detail & Related papers (2023-11-24T05:11:35Z) - Scene Text Image Super-Resolution via Content Perceptual Loss and
Criss-Cross Transformer Blocks [48.81850740907517]
We present TATSR, a Text-Aware Text Super-Resolution framework.
It effectively learns the unique text characteristics using Criss-Cross Transformer Blocks (CCTBs) and a novel Content Perceptual (CP) Loss.
It outperforms state-of-the-art methods in terms of both recognition accuracy and human perception.
arXiv Detail & Related papers (2022-10-13T11:48:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.