Fugu-MT 論文翻訳(概要): SSG: Scaled Spatial Guidance for Multi-Scale Visual Autoregressive Generation

論文の概要: SSG: Scaled Spatial Guidance for Multi-Scale Visual Autoregressive Generation

arxiv url: http://arxiv.org/abs/2602.05534v1
Date: Thu, 05 Feb 2026 10:48:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-06 18:49:08.892565
Title: SSG: Scaled Spatial Guidance for Multi-Scale Visual Autoregressive Generation
Title（参考訳）: SSG:マルチスケール視覚自己回帰生成のための空間誘導
Authors: Youngwoo Shin, Jiwan Hur, Junmo Kim,
Abstract要約: 視覚自己回帰モデル(VAR)は次のスケールの予測を通じて画像を生成する。実際には、この階層はキャパシティの制限と累積誤差によってモデルが粗いから細かい性質から逸脱してしまうため、推論時にドリフトすることができる。本研究では,グローバルなコヒーレンスを維持しつつ,意図した階層に向けて生成を行うためのトレーニング不要な推論時間ガイダンスであるスケールド空間ガイダンス(SSG)を提案する。
参考スコア（独自算出の注目度）: 10.295970926059812
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual autoregressive (VAR) models generate images through next-scale prediction, naturally achieving coarse-to-fine, fast, high-fidelity synthesis mirroring human perception. In practice, this hierarchy can drift at inference time, as limited capacity and accumulated error cause the model to deviate from its coarse-to-fine nature. We revisit this limitation from an information-theoretic perspective and deduce that ensuring each scale contributes high-frequency content not explained by earlier scales mitigates the train-inference discrepancy. With this insight, we propose Scaled Spatial Guidance (SSG), training-free, inference-time guidance that steers generation toward the intended hierarchy while maintaining global coherence. SSG emphasizes target high-frequency signals, defined as the semantic residual, isolated from a coarser prior. To obtain this prior, we leverage a principled frequency-domain procedure, Discrete Spatial Enhancement (DSE), which is devised to sharpen and better isolate the semantic residual through frequency-aware construction. SSG applies broadly across VAR models leveraging discrete visual tokens, regardless of tokenization design or conditioning modality. Experiments demonstrate SSG yields consistent gains in fidelity and diversity while preserving low latency, revealing untapped efficiency in coarse-to-fine image generation. Code is available at https://github.com/Youngwoo-git/SSG.
Abstract（参考訳）: 視覚自己回帰(VAR)モデルは、人間の知覚を反映した粗大で高速で高忠実な合成を自然に達成し、次のスケールの予測を通じて画像を生成する。実際には、この階層はキャパシティの制限と累積誤差によってモデルが粗いから細かい性質から逸脱してしまうため、推論時にドリフトすることができる。我々は、この制限を情報理論の観点から再考し、各スケールが、以前のスケールで説明されていない高周波コンテンツに寄与することを保証することで、列車の干渉の相違を緩和する。そこで本研究では,グローバルコヒーレンスを維持しつつ,意図した階層に向けて,学習不要で推論時間のガイダンスであるスケールド空間ガイダンス(SSG)を提案する。 SSGは、セマンティック残基として定義されたターゲットの高周波信号を強調し、前もって粗い部分から分離した。そこで本研究では,周波数対応構築による意味的残差の鮮明化と分離を図るために,周波数領域法である離散空間強調法(DSE)を応用した。 SSGは、トークン化設計や条件付モダリティに関わらず、離散的な視覚トークンを活用するVARモデル全体に広く適用される。実験では、SSGは低レイテンシを保ちながら、忠実度と多様性が一貫した向上を示し、粗い画像生成における未解決の効率を明らかにする。コードはhttps://github.com/Youngwoo-git/SSG.comで入手できる。

論文の概要: SSG: Scaled Spatial Guidance for Multi-Scale Visual Autoregressive Generation

関連論文リスト