Related papers: SSG: Scaled Spatial Guidance for Multi-Scale Visual Autoregressive Generation

SSG: Scaled Spatial Guidance for Multi-Scale Visual Autoregressive Generation

URL: http://arxiv.org/abs/2602.05534v1
Date: Thu, 05 Feb 2026 10:48:58 GMT
Title: SSG: Scaled Spatial Guidance for Multi-Scale Visual Autoregressive Generation
Authors: Youngwoo Shin, Jiwan Hur, Junmo Kim,
Abstract summary: Visual autoregressive ( VAR) models generate images through next-scale prediction, naturally achieving coarse-to-fine, fast, high-fidelity mirroring human perception.<n>In practice, this hierarchy can drift at inference time, as limited capacity and accumulated error cause the model to deviate from its coarse-to-fine nature.<n>We propose Scaled Spatial Guidance (SSG), training-free, inference-time guidance that steers generation toward the intended hierarchy while maintaining global coherence.
Score: 10.295970926059812
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual autoregressive (VAR) models generate images through next-scale prediction, naturally achieving coarse-to-fine, fast, high-fidelity synthesis mirroring human perception. In practice, this hierarchy can drift at inference time, as limited capacity and accumulated error cause the model to deviate from its coarse-to-fine nature. We revisit this limitation from an information-theoretic perspective and deduce that ensuring each scale contributes high-frequency content not explained by earlier scales mitigates the train-inference discrepancy. With this insight, we propose Scaled Spatial Guidance (SSG), training-free, inference-time guidance that steers generation toward the intended hierarchy while maintaining global coherence. SSG emphasizes target high-frequency signals, defined as the semantic residual, isolated from a coarser prior. To obtain this prior, we leverage a principled frequency-domain procedure, Discrete Spatial Enhancement (DSE), which is devised to sharpen and better isolate the semantic residual through frequency-aware construction. SSG applies broadly across VAR models leveraging discrete visual tokens, regardless of tokenization design or conditioning modality. Experiments demonstrate SSG yields consistent gains in fidelity and diversity while preserving low latency, revealing untapped efficiency in coarse-to-fine image generation. Code is available at https://github.com/Youngwoo-git/SSG.

Related papers

GTS: Inference-Time Scaling of Latent Reasoning with a Learnable Gaussian Thought Sampler [54.10960908347221]
We model latent thought exploration as conditional sampling from learnable densities and instantiate this idea as a Gaussian Thought Sampler (GTS)<n>GTS predicts context-dependent perturbation distributions over continuous reasoning states and is trained with GRPO-style policy optimization while keeping the backbone frozen.
arXiv Detail & Related papers (2026-02-15T09:57:47Z)
TIP: Resisting Gradient Inversion via Targeted Interpretable Perturbation in Federated Learning [8.156452885913108]
Federated Learning (FL) facilitates collaborative model training while preserving data locality.<n>The exchange of gradients renders the system vulnerable to Gradient Inversion Attacks (GIAs)<n>We propose Targeted Interpretable Perturbation (TIP), a novel defense framework that integrates model interpretability with frequency domain analysis.
arXiv Detail & Related papers (2026-02-12T06:32:49Z)
HiGFA: Hierarchical Guidance for Fine-grained Data Augmentation with Diffusion Models [82.10385962490051]
Generative diffusion models show promise for data augmentation.<n>Applying them to fine-grained tasks presents a significant challenge.<n>HiGFA is a hierarchical, confidence-driven orchestration that generates diverse yet faithful synthetic images.
arXiv Detail & Related papers (2025-11-16T10:46:16Z)
Latent Harmony: Synergistic Unified UHD Image Restoration via Latent Space Regularization and Controllable Refinement [89.99237142387655]
We introduce LH-VAE, which enhances semantic robustness through visual semantic constraints and progressive degradations.<n>Latent Harmony is a two-stage framework that redefines VAEs for UHD restoration by jointly regularizing the latent space and enforcing high-frequency-aware reconstruction.<n>Experiments show Latent Harmony achieves state-of-the-art performance across UHD and standard-resolution tasks, effectively balancing efficiency, perceptual quality, and reconstruction accuracy.
arXiv Detail & Related papers (2025-10-09T08:54:26Z)
STAF: Sinusoidal Trainable Activation Functions for Implicit Neural Representation [7.2888019138115245]
Implicit Neural Representations (INRs) have emerged as a powerful framework for modeling continuous signals.<n>The spectral bias of ReLU-based networks is a well-established limitation, restricting their ability to capture fine-grained details in target signals.<n>We introduce Sinusoidal Trainable Functions Activation (STAF)<n>STAF inherently modulates its frequency components, allowing for self-adaptive spectral learning.
arXiv Detail & Related papers (2025-02-02T18:29:33Z)
Self-Guidance: Boosting Flow and Diffusion Generation on Their Own [35.56845917727121]
Self-Guidance (SG) can significantly improve the quality of the generated image by suppressing the generation of low-quality samples.<n>SG relies on the sampling score function of the original diffusion or flow model at different noise levels.<n>We conduct extensive experiments on text-to-image and text-to-video generation with different architectures.
arXiv Detail & Related papers (2024-12-08T06:32:27Z)
MS$^3$D: A RG Flow-Based Regularization for GAN Training with Limited Data [16.574346252357653]
We propose a novel regularization method based on the idea of renormalization group (RG) in physics. We show that our method can effectively enhance the performance and stability of GANs under limited data scenarios.
arXiv Detail & Related papers (2024-08-20T18:37:37Z)
GIFD: A Generative Gradient Inversion Method with Feature Domain Optimization [52.55628139825667]
Federated Learning (FL) has emerged as a promising distributed machine learning framework to preserve clients' privacy. Recent studies find that an attacker can invert the shared gradients and recover sensitive data against an FL system by leveraging pre-trained generative adversarial networks (GAN) as prior knowledge. We propose textbfGradient textbfInversion over textbfFeature textbfDomains (GIFD), which disassembles the GAN model and searches the feature domains of the intermediate layers.
arXiv Detail & Related papers (2023-08-09T04:34:21Z)
LD-GAN: Low-Dimensional Generative Adversarial Network for Spectral Image Generation with Variance Regularization [72.4394510913927]
Deep learning methods are state-of-the-art for spectral image (SI) computational tasks. GANs enable diverse augmentation by learning and sampling from the data distribution. GAN-based SI generation is challenging since the high-dimensionality nature of this kind of data hinders the convergence of the GAN training yielding to suboptimal generation. We propose a statistical regularization to control the low-dimensional representation variance for the autoencoder training and to achieve high diversity of samples generated with the GAN.
arXiv Detail & Related papers (2023-04-29T00:25:02Z)
Generalized Zero-Shot Learning via VAE-Conditioned Generative Flow [83.27681781274406]
Generalized zero-shot learning aims to recognize both seen and unseen classes by transferring knowledge from semantic descriptions to visual representations. Recent generative methods formulate GZSL as a missing data problem, which mainly adopts GANs or VAEs to generate visual features for unseen classes. We propose a conditional version of generative flows for GZSL, i.e., VAE-Conditioned Generative Flow (VAE-cFlow)
arXiv Detail & Related papers (2020-09-01T09:12:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.