Fugu-MT 論文翻訳(概要): Visual Autoregressive Modeling for Instruction-Guided Image Editing

論文の概要: Visual Autoregressive Modeling for Instruction-Guided Image Editing

arxiv url: http://arxiv.org/abs/2508.15772v1
Date: Thu, 21 Aug 2025 17:59:32 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-22 16:26:46.445426
Title: Visual Autoregressive Modeling for Instruction-Guided Image Editing
Title（参考訳）: インストラクションガイドによる画像編集のための視覚自己回帰モデリング
Authors: Qingyang Mao, Qi Cai, Yehao Li, Yingwei Pan, Mingyue Cheng, Ting Yao, Qi Liu, Tao Mei,
Abstract要約: 画像編集を次世代の予測問題として再編成する視覚的自己回帰フレームワークを提案する。 VarEditは、正確な編集を実現するために、マルチスケールのターゲット機能を生成する。 1.2秒で512times512$編集を完了し、同じサイズのUltraEditよりも2.2$times$高速になった。
参考スコア（独自算出の注目度）: 97.04821896251681
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in diffusion models have brought remarkable visual fidelity to instruction-guided image editing. However, their global denoising process inherently entangles the edited region with the entire image context, leading to unintended spurious modifications and compromised adherence to editing instructions. In contrast, autoregressive models offer a distinct paradigm by formulating image synthesis as a sequential process over discrete visual tokens. Their causal and compositional mechanism naturally circumvents the adherence challenges of diffusion-based methods. In this paper, we present VAREdit, a visual autoregressive (VAR) framework that reframes image editing as a next-scale prediction problem. Conditioned on source image features and text instructions, VAREdit generates multi-scale target features to achieve precise edits. A core challenge in this paradigm is how to effectively condition the source image tokens. We observe that finest-scale source features cannot effectively guide the prediction of coarser target features. To bridge this gap, we introduce a Scale-Aligned Reference (SAR) module, which injects scale-matched conditioning information into the first self-attention layer. VAREdit demonstrates significant advancements in both editing adherence and efficiency. On standard benchmarks, it outperforms leading diffusion-based methods by 30\%+ higher GPT-Balance score. Moreover, it completes a $512\times512$ editing in 1.2 seconds, making it 2.2$\times$ faster than the similarly sized UltraEdit. The models are available at https://github.com/HiDream-ai/VAREdit.
Abstract（参考訳）: 拡散モデルの最近の進歩は、命令誘導画像編集に顕著な視覚的忠実性をもたらしている。しかし、それらのグローバルな記述プロセスは本質的には、編集された領域を画像のコンテキスト全体と結び付け、意図しない急激な修正と編集命令の遵守を損なう。対照的に自己回帰モデルは、画像合成を離散的な視覚トークン上の逐次過程として定式化することによって、異なるパラダイムを提供する。その因果的および構成的メカニズムは、拡散に基づく手法の定着の難しさを自然に回避する。本稿では,視覚的自己回帰(VAR)フレームワークであるVAREditについて述べる。ソース画像の特徴とテキスト命令に基づいて、VAREditは、正確な編集を実現するために、マルチスケールのターゲット機能を生成する。このパラダイムの中核的な課題は、ソース画像トークンを効果的に条件付けする方法である。我々は、最も優れた音源特徴は、粗い対象特徴の予測を効果的に導くことができないことを観察する。このギャップを埋めるために、第1の自己注意層にスケール整合条件情報を注入するスケール整合参照(SAR)モジュールを導入する。 VAREditは、編集の順守と効率の両面で大きな進歩を示している。標準ベンチマークでは、リード拡散ベースの手法を30 %以上高い GPT-Balance スコアで上回っている。さらに、1.2秒で512\times512$編集を完了し、同じサイズのUltraEditよりも2.2$\times$高速になった。モデルはhttps://github.com/HiDream-ai/VAREditで入手できる。

論文の概要: Visual Autoregressive Modeling for Instruction-Guided Image Editing

関連論文リスト