Fugu-MT 論文翻訳(概要): Residual Decoder Adapter: ID-Preserving Tokenizer Adaption for Autoregressive Text Rendering

論文の概要: Residual Decoder Adapter: ID-Preserving Tokenizer Adaption for Autoregressive Text Rendering

arxiv url: http://arxiv.org/abs/2606.01911v1
Date: Mon, 01 Jun 2026 08:47:17 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-03 00:57:58.948932
Title: Residual Decoder Adapter: ID-Preserving Tokenizer Adaption for Autoregressive Text Rendering
Title（参考訳）: 残留デコーダアダプタ:自動回帰テキストレンダリングのためのID保存トケナイザ適応
Authors: Dongxing Mao, Jinpeng Wang, Jiahao Tang, Kevin Qinghong Lin, Linjie Li, Zhengyuan Yang, Lijuan Wang, Min Li, Jingru Tan,
Abstract要約: Visual Autoregressive (AR)モデルは、視覚的トークン化器によってデコードされた離散トークンを予測して画像を生成する。全体的な画像生成能力は高いが、ぼやけたストロークと文字の形を乱したテキストレンダリングでは依然として性能が劣っている。トークン空間を変更することなく既存のトークン化器を更新するResidual Decoder Adapter(RDA)を提案する。
参考スコア（独自算出の注目度）: 92.43552212966732
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visual Autoregressive (AR) models generate images by predicting discrete tokens that are decoded by a visual tokenizer. Despite demonstrating strong overall image generation ability, they still underperform on text rendering with blur strokes and disrupt letter shapes. In this work, we trace this limitation to the visual tokenizer, which struggles to reconstruct fine-grained detail. Improving the tokenizer is straightforward but expensive, as it necessitates retraining both the tokenizer and the AR model. Can we improve text rendering performance of AR models without retraining the existing tokenizer and AR model? To achieve this, we propose the Residual Decoder Adapter(RDA) that upgrades an existing tokenizer post-hoc without changing its token space. Specifically, it refines the decoder output of the visual tokenizer by introducing two novel components: (i) a paired codebook that shares the token distribution with the original one; (ii) a parallel branch to learn the tiny differences (residual) between the reconstructed image and the ground-truth images in the pixel space. This residual design allows us to enhance the tokenizer non-invasively while preserving compatibility with prior AR models. RDA substantially improves text rendering significantly by a large margin. For instance, we boost finetuned Janus-Pro OCR accuracy rises from 24.52% to 58.26% (TextVisionBlend), from 12.75% to 36.81% (StyledTextSynth) on competitive TextAtlas benchmark. The code is available at https://github.com/CSU-JPG/RDA
Abstract（参考訳）: Visual Autoregressive (AR)モデルは、視覚的トークン化器によってデコードされた離散トークンを予測して画像を生成する。全体的な画像生成能力は高いが、ぼやけたストロークと文字の形を乱したテキストレンダリングでは依然として性能が劣っている。本研究では、この制限を視覚的トークン化器に遡り、細かな詳細を再構築するのに苦労する。トークン化ツールとARモデルの両方を再トレーニングする必要があるため、トークン化ツールの改善は単純だがコストがかかる。既存のトークン化とARモデルを再トレーニングすることなく、ARモデルのテキストレンダリング性能を向上できるだろうか? これを実現するために、トークン空間を変更することなく既存のトークン化器を更新するResidual Decoder Adapter(RDA)を提案する。具体的には、2つの新しいコンポーネントを導入することで、視覚トークン化器のデコーダ出力を洗練する。 i) トークン分布を元のものと共有するペアコードブック (i) 画素空間における再構成画像と接地構造画像との微妙な差(残留)を学習するための並列分岐。この残留設計により、従来のARモデルとの互換性を維持しつつ、トークン化を非侵襲的に強化することができる。 RDAはテキストレンダリングを大幅に改善する。例えば、競合するTextAtlasベンチマークでは、微調整のJanus-Pro OCR精度が24.52%から58.26%(TextVisionBlend)に向上し、12.75%から36.81%(StyledTextSynth)に向上しました。コードはhttps://github.com/CSU-JPG/RDAで公開されている。

論文の概要: Residual Decoder Adapter: ID-Preserving Tokenizer Adaption for Autoregressive Text Rendering

関連論文リスト