Fugu-MT 論文翻訳(概要): Self-Corrected Image Generation with Explainable Latent Rewards

論文の概要: Self-Corrected Image Generation with Explainable Latent Rewards

arxiv url: http://arxiv.org/abs/2603.24965v1
Date: Thu, 26 Mar 2026 02:59:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-27 20:52:48.064785
Title: Self-Corrected Image Generation with Explainable Latent Rewards
Title（参考訳）: 説明可能な遅延リワードによる自己補正画像生成
Authors: Yinyi Luo, Hrishikesh Gokhale, Marios Savvides, Jindong Wang, Shengfeng He,
Abstract要約: 我々は、説明可能なLatent RewarDを通じて生成をガイドする自己修正フレームワークであるxLARDを提案する。 xLARDは、モデル生成参照からの構造化されたフィードバックに基づいて遅延表現を洗練する軽量な修正器を導入している。実験により、xLARDは、生成前の状態を維持しながら、意味的アライメントと視覚的忠実性を改善することが示された。
参考スコア（独自算出の注目度）: 55.29175717238288
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite significant progress in text-to-image generation, aligning outputs with complex prompts remains challenging, particularly for fine-grained semantics and spatial relations. This difficulty stems from the feed-forward nature of generation, which requires anticipating alignment without fully understanding the output. In contrast, evaluating generated images is more tractable. Motivated by this asymmetry, we propose xLARD, a self-correcting framework that uses multimodal large language models to guide generation through Explainable LAtent RewarDs. xLARD introduces a lightweight corrector that refines latent representations based on structured feedback from model-generated references. A key component is a differentiable mapping from latent edits to interpretable reward signals, enabling continuous latent-level guidance from non-differentiable image-level evaluations. This mechanism allows the model to understand, assess, and correct itself during generation. Experiments across diverse generation and editing tasks show that xLARD improves semantic alignment and visual fidelity while maintaining generative priors. Code is available at https://yinyiluo.github.io/xLARD/.
Abstract（参考訳）: テキスト・画像生成の大幅な進歩にもかかわらず、複雑なプロンプトによる出力の整列は、特に微粒な意味論や空間的関係において難しいままである。この難しさは、出力を完全に理解せずにアライメントを予測する必要がある生成のフィードフォワードの性質に起因している。対照的に、生成された画像を評価することはより魅力的である。この非対称性を動機として,マルチモーダルな大規模言語モデルを用いた自己修正フレームワークであるxLARDを提案する。 xLARDは、モデル生成参照からの構造化されたフィードバックに基づいて遅延表現を洗練する軽量な修正器を導入している。重要なコンポーネントは、遅延編集から解釈可能な報酬信号への微分可能なマッピングであり、非微分可能な画像レベル評価からの連続的な遅延レベルガイダンスを可能にする。このメカニズムにより、モデルが生成時に自分自身を理解し、評価し、修正することができる。多様な生成および編集タスクにわたる実験により、xLARDは、生成前の状態を維持しながら、セマンティックアライメントと視覚的忠実性を改善することが示されている。コードはhttps://yinyiluo.github.io/xLARD/で入手できる。

論文の概要: Self-Corrected Image Generation with Explainable Latent Rewards

関連論文リスト