FuguReport

Residual Decoder Adapter: ID-Preserving Tokenizer Adaption for Autoregressive Text Rendering

Authors Dongxing Mao, Jinpeng Wang, Jiahao Tang, Kevin Qinghong Lin, Linjie Li, Zhengyuan Yang, Lijuan Wang, Min Li, Jingru Tan
Affiliations Central South University / University of Oxford / Microsoft
Categories Method / Model Adaptation / Updating tokenizer without changing token space, Task / Image Generation / Text rendering in image synthesis, Evaluation / Generation Quality / Performance on stroke clarity and character shape
License CC BY 4.0

Abstract Overview

This paper studies why visual autoregressive image generators remain weak at rendering text and argues that the tokenizer decoder is a central bottleneck because it loses fine-grained textual detail during reconstruction. The authors propose Residual Decoder Adapter (RDA), a post-hoc module that upgrades an existing discrete visual tokenizer without changing its token IDs, so pretrained AR models can use it without retraining. RDA combines a shared-ID hint codebook with a residual decoder that predicts pixel-space corrections on top of the frozen tokenizer output. The method is presented as a plug-and-play way to improve text fidelity while preserving compatibility with existing AR systems and largely maintaining general image quality.

Novelty

The distinctive idea is to improve tokenizer decoding while preserving the original token space, allowing direct reuse of pretrained AR models instead of retraining the tokenizer-AR pipeline. Its shared-ID hint codebook plus residual pixel decoder provides a non-invasive mechanism for recovering high-frequency text details from the same predicted token IDs.

Results

Across multiple general and text-specialized AR models, RDA consistently improves text-rendering metrics, with especially large gains when the AR model is already fine-tuned for text generation. For example, finetuned Janus-Pro 1B improves OCR accuracy on StyledTextVisionBlend from 24.52% to 58.26% and on StyledTextSynth from 12.75% to 36.81%, while tokenizer reconstruction metrics also improve on several text-centric datasets. The paper also reports that RDA is more robust out of distribution than directly fine-tuning the decoder, which substantially worsens ImageNet FID.

Key Points

  1. RDA enhances text rendering by refining the frozen tokenizer decoder output with residual image corrections rather than altering token IDs.
  2. The method transfers across pretrained AR architectures such as Janus-Pro, TAR, and Lumina-mGPT as a plug-and-play adapter without AR retraining.
  3. Empirical results show consistent gains in OCR-oriented generation and reconstruction benchmarks, with particularly strong improvements on text-specialized models and competitive OOD behavior.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.