Fugu-MT 論文翻訳(概要): REAR: Rethinking Visual Autoregressive Models via Generator-Tokenizer Consistency Regularization

論文の概要: REAR: Rethinking Visual Autoregressive Models via Generator-Tokenizer Consistency Regularization

arxiv url: http://arxiv.org/abs/2510.04450v1
Date: Mon, 06 Oct 2025 02:48:13 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-07 16:52:59.663162
Title: REAR: Rethinking Visual Autoregressive Models via Generator-Tokenizer Consistency Regularization
Title（参考訳）: REAR:generator-Tokenizer Consistency Regularizationによる視覚自己回帰モデルの再考
Authors: Qiyuan He, Yicong Li, Haotian Ye, Jinghao Wang, Xinyao Liao, Pheng-Ann Heng, Stefano Ermon, James Zou, Angela Yao,
Abstract要約: reARはトークン単位の正規化目標を導入する単純なトレーニング戦略です。 ImageNetでは、gFIDを3.02から1.86に削減し、標準化ベースのトークンーザを使用してISを316.9に改善している。高度なトークン化器に適用すると、177Mパラメータしか持たない1.42のgFIDが達成され、その性能はより大きな最先端拡散モデル(675M)と一致する。
参考スコア（独自算出の注目度）: 130.46612643194973
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visual autoregressive (AR) generation offers a promising path toward unifying vision and language models, yet its performance remains suboptimal against diffusion models. Prior work often attributes this gap to tokenizer limitations and rasterization ordering. In this work, we identify a core bottleneck from the perspective of generator-tokenizer inconsistency, i.e., the AR-generated tokens may not be well-decoded by the tokenizer. To address this, we propose reAR, a simple training strategy introducing a token-wise regularization objective: when predicting the next token, the causal transformer is also trained to recover the visual embedding of the current token and predict the embedding of the target token under a noisy context. It requires no changes to the tokenizer, generation order, inference pipeline, or external models. Despite its simplicity, reAR substantially improves performance. On ImageNet, it reduces gFID from 3.02 to 1.86 and improves IS to 316.9 using a standard rasterization-based tokenizer. When applied to advanced tokenizers, it achieves a gFID of 1.42 with only 177M parameters, matching the performance with larger state-of-the-art diffusion models (675M).
Abstract（参考訳）: 視覚的自己回帰(AR)生成は、視覚と言語モデルを統一するための有望な経路を提供するが、その性能は拡散モデルに対する準最適である。以前の作業では、しばしばこのギャップはトークン化の制限とラスタ化の順序に起因している。本研究では,ジェネレータとトークンの整合性の観点からコアボトルネックを同定する。そこで我々は,次のトークンを予測する際に,現在のトークンの視覚的埋め込みを回復し,ノイズの多いコンテキスト下でターゲットトークンの埋め込みを予測するために,因果変換器を訓練する,トークンワイド正規化目標を導入する簡単なトレーニング戦略であるreARを提案する。トークン処理器、生成順序、推論パイプライン、外部モデルの変更は必要ありません。単純さにもかかわらず、reARは性能を大幅に改善する。 ImageNet では、gFID を 3.02 から 1.86 に削減し、IS を 316.9 に改善する。高度なトークン化器に適用すると、177Mパラメータしか持たない1.42のgFIDが達成され、その性能はより大きな最先端拡散モデル(675M)と一致する。

論文の概要: REAR: Rethinking Visual Autoregressive Models via Generator-Tokenizer Consistency Regularization

関連論文リスト