Fugu-MT 論文翻訳(概要): Policy-based Tuning of Autoregressive Image Models with Instance- and Distribution-Level Rewards

論文の概要: Policy-based Tuning of Autoregressive Image Models with Instance- and Distribution-Level Rewards

arxiv url: http://arxiv.org/abs/2603.23086v1
Date: Tue, 24 Mar 2026 11:28:36 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-25 19:53:37.453458
Title: Policy-based Tuning of Autoregressive Image Models with Instance- and Distribution-Level Rewards
Title（参考訳）: インスタンスと分散レベルリワードを考慮した自己回帰画像モデルのポリシーに基づくチューニング
Authors: Orhun Buğra Baran, Melih Kandemir, Ramazan Gokberk Cinbis,
Abstract要約: 自己回帰モデル(AR)は画像生成に非常に効果的であるが、標準の最大形推定トレーニングではサンプルの品質と多様性を直接最適化することができない。本稿では,トークンベースのAR推論をマルコフ決定プロセスとして,グループ相対ポリシー最適化によって最適化した軽量なRLフレームワークを提案する。私たちの中核的な貢献は、新しい流通レベルのLeave-One-Out FID(LOO-FID)の報酬の導入です。
参考スコア（独自算出の注目度）: 16.135177543347773
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Autoregressive (AR) models are highly effective for image generation, yet their standard maximum-likelihood estimation training lacks direct optimization for sample quality and diversity. While reinforcement learning (RL) has been used to align diffusion models, these methods typically suffer from output diversity collapse. Similarly, concurrent RL methods for AR models rely strictly on instance-level rewards, often trading off distributional coverage for quality. To address these limitations, we propose a lightweight RL framework that casts token-based AR synthesis as a Markov Decision Process, optimized via Group Relative Policy Optimization (GRPO). Our core contribution is the introduction of a novel distribution-level Leave-One-Out FID (LOO-FID) reward; by leveraging an exponential moving average of feature moments, it explicitly encourages sample diversity and prevents mode collapse during policy updates. We integrate this with composite instance-level rewards (CLIP and HPSv2) for strict semantic and perceptual fidelity, and stabilize the multi-objective learning with an adaptive entropy regularization term. Extensive experiments on LlamaGen and VQGAN architectures demonstrate clear improvements across standard quality and diversity metrics within only a few hundred tuning iterations. The results also show that the model can be updated to produce competitive samples even without Classifier-Free Guidance, and bypass its 2x inference cost.
Abstract（参考訳）: 自己回帰モデル(AR)は画像生成に非常に効果的であるが、標準の最大形推定トレーニングではサンプルの品質と多様性を直接最適化することができない。拡散モデルを調整するために強化学習(RL)が用いられているが、これらの手法は通常、出力の多様性の崩壊に悩まされる。同様に、ARモデルの並行RLメソッドはインスタンスレベルの報酬に厳密に依存しており、しばしば品質の分散カバレッジをトレードオフする。これらの制約に対処するため,トークンベースのAR合成をグループ相対ポリシー最適化(GRPO)により最適化したマルコフ決定プロセスとしてキャストする軽量なRLフレームワークを提案する。機能モーメントの指数的な移動平均を利用して、サンプルの多様性を明示的に促進し、ポリシー更新時のモード崩壊を防止する。我々はこれを、厳密な意味と知覚の忠実度のために複合インスタンスレベル報酬(CLIPとHPSv2)と統合し、適応的エントロピー正規化項で多目的学習を安定化させる。 LlamaGenとVQGANアーキテクチャに関する大規模な実験は、数百回のチューニングイテレーションで標準品質と多様性メトリクスをまたいだ明確な改善を実証している。また,2倍の推論コストを回避し,分類自由ガイダンスを使わずに,競合サンプルを生成するためにモデルを更新できることが示唆された。

論文の概要: Policy-based Tuning of Autoregressive Image Models with Instance- and Distribution-Level Rewards

関連論文リスト