AlignVAR: Towards Globally Consistent Visual Autoregression for Image Super-Resolution
- URL: http://arxiv.org/abs/2603.00589v2
- Date: Thu, 05 Mar 2026 05:53:46 GMT
- Title: AlignVAR: Towards Globally Consistent Visual Autoregression for Image Super-Resolution
- Authors: Cencen Liu, Dongyang Zhang, Wen Yin, Jielei Wang, Tianyu Li, Ji Guo, Wenbo Jiang, Guoqing Wang, Guoming Lu,
- Abstract summary: Visual autoregressive models offer stable training, non-iterative inference, and high-fidelity synthesis through next-scale prediction.<n>But their application remains underexplored and faces two critical challenges: locality-biased attention, and residual-only supervision.<n>We propose a globally consistent visual autoregressive framework tailored for image super-resolution.
- Score: 16.90182090355781
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual autoregressive (VAR) models have recently emerged as a promising alternative for image generation, offering stable training, non-iterative inference, and high-fidelity synthesis through next-scale prediction. This encourages the exploration of VAR for image super-resolution (ISR), yet its application remains underexplored and faces two critical challenges: locality-biased attention, which fragments spatial structures, and residual-only supervision, which accumulates errors across scales, severely compromises global consistency of reconstructed images. To address these issues, we propose AlignVAR, a globally consistent visual autoregressive framework tailored for ISR, featuring two key components: (1) Spatial Consistency Autoregression (SCA), which applies an adaptive mask to reweight attention toward structurally correlated regions, thereby mitigating excessive locality and enhancing long-range dependencies; and (2) Hierarchical Consistency Constraint (HCC), which augments residual learning with full reconstruction supervision at each scale, exposing accumulated deviations early and stabilizing the coarse-to-fine refinement process. Extensive experiments demonstrate that AlignVAR consistently enhances structural coherence and perceptual fidelity over existing generative methods, while delivering over 10x faster inference with nearly 50% fewer parameters than leading diffusion-based approaches, establishing a new paradigm for efficient ISR.
Related papers
- StepVAR: Structure-Texture Guided Pruning for Visual Autoregressive Models [98.72926158261937]
We propose a training-free token pruning framework for Visual AutoRegressive models.<n>We employ a lightweight high-pass filter to capture local texture details, while leveraging Principal Component Analysis (PCA) to preserve global structural information.<n>To maintain valid next-scale prediction under sparse tokens, we introduce a nearest neighbor feature propagation strategy.
arXiv Detail & Related papers (2026-03-02T11:35:05Z) - Bidirectional Reward-Guided Diffusion for Real-World Image Super-Resolution [79.35296000454694]
Diffusion-based super-resolution can synthesize rich details, but models trained on synthetic paired data often fail on real-world LR images.<n>We propose Bird-SR, a reward-guided diffusion framework that formulates super-resolution as trajectory-level preference optimization.<n>Experiments on real-world SR benchmarks demonstrate that Bird-SR consistently outperforms state-of-the-art methods in perceptual quality.
arXiv Detail & Related papers (2026-02-05T19:21:45Z) - HSI-VAR: Rethinking Hyperspectral Restoration through Spatial-Spectral Visual Autoregression [43.90363193188088]
Hyperspectral images (HSIs) capture richer spatial-spectral information beyond RGB.<n>Real-world HSIs often suffer from a composite mix of degradations, such as noise, blur, and missing bands.
arXiv Detail & Related papers (2026-01-31T14:30:05Z) - OSDEnhancer: Taming Real-World Space-Time Video Super-Resolution with One-Step Diffusion [64.10689934231165]
Video super-resolution models (DMs) have demonstrated exceptional success in video super-resolution (VSR)<n>Their potential for space-time video super-resolution (STVSR) necessitates recovering realistic visual content from low to high-resolution but also improving the frame rate with coherent dynamics.<n>We propose OSDEnhancer, a framework that represents the first method to initialize real-world STVSR through an efficient one-step diffusion process.<n> Experiments demonstrate that the proposed method achieves state-of-the-art performance while maintaining superior capability in real-world scenarios.
arXiv Detail & Related papers (2026-01-28T06:59:55Z) - Towards Any-Quality Image Segmentation via Generative and Adaptive Latent Space Enhancement [27.566673104431725]
Segment Anything Models (SAMs) are known for their exceptional zero-shot segmentation performance.<n>However, their performance drops significantly on severely degraded, low-quality images, limiting their effectiveness in real-world scenarios.<n>We propose GleSAM++, which utilizes Generative Latent space Enhancement to boost robustness on low-quality images.
arXiv Detail & Related papers (2026-01-05T11:28:58Z) - Iterative Inference-time Scaling with Adaptive Frequency Steering for Image Super-Resolution [75.3690742776891]
We propose Iterative Diffusion Inference-Time Scaling with Adaptive Frequency Steering (IAFS)<n>IAFS addresses the challenge of balancing perceptual quality and structural fidelity by progressively refining the generated image through iterative correction of structural deviations.<n>Experiments show that IAFS effectively resolves the perception-fidelity conflict, yielding consistently improved perceptual detail and structural accuracy, and outperforming existing inference-time scaling methods.
arXiv Detail & Related papers (2025-12-29T15:09:20Z) - Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding [54.05243949024302]
Existing robust MLLMs rely on implicit training/adaptation that focuses solely on visual encoder generalization.<n>We propose Robust-R1, a novel framework that explicitly models visual degradations through structured reasoning chains.<n>Our approach integrates: (i) supervised fine-tuning for degradation-aware reasoning foundations, (ii) reward-driven alignment for accurately perceiving degradation parameters, and (iii) dynamic reasoning depth scaling adapted to degradation intensity.
arXiv Detail & Related papers (2025-12-19T12:56:17Z) - Latent Harmony: Synergistic Unified UHD Image Restoration via Latent Space Regularization and Controllable Refinement [89.99237142387655]
We introduce LH-VAE, which enhances semantic robustness through visual semantic constraints and progressive degradations.<n>Latent Harmony is a two-stage framework that redefines VAEs for UHD restoration by jointly regularizing the latent space and enforcing high-frequency-aware reconstruction.<n>Experiments show Latent Harmony achieves state-of-the-art performance across UHD and standard-resolution tasks, effectively balancing efficiency, perceptual quality, and reconstruction accuracy.
arXiv Detail & Related papers (2025-10-09T08:54:26Z) - Edge-Aware Normalized Attention for Efficient and Detail-Preserving Single Image Super-Resolution [27.3322913419539]
Single-image super-resolution (SISR) remains highly ill-posed because recovering structurally faithful high-frequency content from a single low-resolution observation is ambiguous.<n>Existing edge-aware methods often attach edge priors or attention branches onto increasingly complex backbones, yet ad hoc fusion frequently introduces redundancy, unstable optimization, or limited structural gains.<n>We address this gap with an edge-guided attention mechanism that derives an adaptive modulation map from jointly encoded edge features and intermediate feature activations, then applies it to normalize and reweight responses, selectively amplifying structurally salient regions while suppressing spurious textures.
arXiv Detail & Related papers (2025-09-18T02:31:24Z) - GuideSR: Rethinking Guidance for One-Step High-Fidelity Diffusion-Based Super-Resolution [15.563111624900865]
GuideSR is a novel single-step diffusion-based image super-resolution (SR) model specifically designed to enhance image fidelity.<n>Our approach consistently outperforms existing methods across various reference-based metrics including PSNR, SSIM, LPIPS, DISTS and FID.
arXiv Detail & Related papers (2025-05-01T17:48:25Z) - Robust Single Image Dehazing Based on Consistent and Contrast-Assisted
Reconstruction [95.5735805072852]
We propose a novel density-variational learning framework to improve the robustness of the image dehzing model.
Specifically, the dehazing network is optimized under the consistency-regularized framework.
Our method significantly surpasses the state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-29T08:11:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.