FVAR: Visual Autoregressive Modeling via Next Focus Prediction
- URL: http://arxiv.org/abs/2511.18838v1
- Date: Mon, 24 Nov 2025 07:19:04 GMT
- Title: FVAR: Visual Autoregressive Modeling via Next Focus Prediction
- Authors: Xiaofan Li, Chenming Wu, Yanpeng Sun, Jiaming Zhou, Delin Qu, Yansong Qu, Weihao Bo, Haibao Yu, Dingkang Liang,
- Abstract summary: We present textbf', which reframes the paradigm from emphnext-scale prediction to emphnext-focus prediction, mimicking the natural process of camera focusing from blur to clarity.<n>Our approach introduces three key innovations: textbf1) Next-Focus Prediction Paradigm that transforms multi-scale autoregression by progressively reducing blur rather than simply downsampling.<n>textbf2) Progressive Refocusing Pyramid Construction that uses physics-consistent defocus kernels to build clean, alias-free multi-scale representations.
- Score: 35.70387954364497
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual autoregressive models achieve remarkable generation quality through next-scale predictions across multi-scale token pyramids. However, the conventional method uses uniform scale downsampling to build these pyramids, leading to aliasing artifacts that compromise fine details and introduce unwanted jaggies and moiré patterns. To tackle this issue, we present \textbf{FVAR}, which reframes the paradigm from \emph{next-scale prediction} to \emph{next-focus prediction}, mimicking the natural process of camera focusing from blur to clarity. Our approach introduces three key innovations: \textbf{1) Next-Focus Prediction Paradigm} that transforms multi-scale autoregression by progressively reducing blur rather than simply downsampling; \textbf{2) Progressive Refocusing Pyramid Construction} that uses physics-consistent defocus kernels to build clean, alias-free multi-scale representations; and \textbf{3) High-Frequency Residual Learning} that employs a specialized residual teacher network to effectively incorporate alias information during training while maintaining deployment simplicity. Specifically, we construct optical low-pass views using defocus point spread function (PSF) kernels with decreasing radius, creating smooth blur-to-clarity transitions that eliminate aliasing at its source. To further enhance detail generation, we introduce a High-Frequency Residual Teacher that learns from both clean structure and alias residuals, distilling this knowledge to a vanilla VAR deployment network for seamless inference. Extensive experiments on ImageNet demonstrate that FVAR substantially reduces aliasing artifacts, improves fine detail preservation, and enhances text readability, achieving superior performance with perfect compatibility to existing VAR frameworks.
Related papers
- StepVAR: Structure-Texture Guided Pruning for Visual Autoregressive Models [98.72926158261937]
We propose a training-free token pruning framework for Visual AutoRegressive models.<n>We employ a lightweight high-pass filter to capture local texture details, while leveraging Principal Component Analysis (PCA) to preserve global structural information.<n>To maintain valid next-scale prediction under sparse tokens, we introduce a nearest neighbor feature propagation strategy.
arXiv Detail & Related papers (2026-03-02T11:35:05Z) - DreamVAR: Taming Reinforced Visual Autoregressive Model for High-Fidelity Subject-Driven Image Generation [108.71044040025374]
We present a novel framework for subject-driven image synthesis built upon a Visual Autoregressive model that employs next-scale prediction.<n>We show that Dreamthe achieves superior appearance preservation compared to leading diffusion-based methods.
arXiv Detail & Related papers (2026-01-30T03:32:29Z) - Beyond Illumination: Fine-Grained Detail Preservation in Extreme Dark Image Restoration [3.5382753486225087]
We propose an efficient dual-stage approach centered on detail recovery for dark images.<n>In the first stage, we introduce a Residual Fourier-Guided Module (RFGM) that effectively restores global illumination in the frequency domain.<n>RFGM captures inter-stage and inter-channel dependencies through residual connections.<n>Patch Mamba operates on channel-concatenated non-downsampled patches, meticulously modeling pixel-level correlations to enhance fine-grained details without resolution loss.<n>Grad Mamba explicitly focuses on high-gradient regions, alleviating state decay in state space models and prioritizing reconstruction of sharp
arXiv Detail & Related papers (2025-08-05T11:31:08Z) - Learning Deblurring Texture Prior from Unpaired Data with Diffusion Model [92.61216319417208]
We propose a novel diffusion model (DM)-based framework, dubbed ours, for image deblurring.<n>ours performs DM to generate the prior knowledge that aids in recovering the textures of blurry images.<n>To fully exploit the generated texture priors, we present the Texture Transfer Transformer layer (TTformer)
arXiv Detail & Related papers (2025-07-18T01:50:31Z) - MoiréXNet: Adaptive Multi-Scale Demoiréing with Linear Attention Test-Time Training and Truncated Flow Matching Prior [11.753823187605033]
This paper introduces a novel framework for image and video demoir'eing by integrating A Posteriori (MAP) estimation with advanced deep learning techniques.
arXiv Detail & Related papers (2025-06-19T00:15:07Z) - SparseGS-W: Sparse-View 3D Gaussian Splatting in the Wild with Generative Priors [22.561786156613525]
We propose SparseGS-W, a novel framework to Synthesizing novel views of large-scale scenes from unconstrained in-the-wild images.<n>We leverage geometric priors and constrained diffusion priors to compensate for the lack of multi-view information from extremely sparse input.<n>SparseGS-W achieves state-of-the-art performance not only in full-reference metrics, but also in commonly used non-reference metrics such as FID, ClipIQA, and MUSIQ.
arXiv Detail & Related papers (2025-03-25T08:40:40Z) - Unpaired Deblurring via Decoupled Diffusion Model [55.21345354747609]
We propose UID-Diff, a generative-diffusion-based model designed to enhance deblurring performance on unknown domains.<n>We employ two Q-Formers as structural features and blur patterns extractors separately. The features extracted will be used for the supervised deblurring task on synthetic data and the unsupervised blur-transfer task.<n>Experiments on real-world datasets demonstrate that UID-Diff outperforms existing state-of-the-art methods in blur removal and structural preservation.
arXiv Detail & Related papers (2025-02-03T17:00:40Z) - NeRF-VPT: Learning Novel View Representations with Neural Radiance
Fields via View Prompt Tuning [63.39461847093663]
We propose NeRF-VPT, an innovative method for novel view synthesis to address these challenges.
Our proposed NeRF-VPT employs a cascading view prompt tuning paradigm, wherein RGB information gained from preceding rendering outcomes serves as instructive visual prompts for subsequent rendering stages.
NeRF-VPT only requires sampling RGB data from previous stage renderings as priors at each training stage, without relying on extra guidance or complex techniques.
arXiv Detail & Related papers (2024-03-02T22:08:10Z) - Improving Diffusion-Based Image Synthesis with Context Prediction [49.186366441954846]
Existing diffusion models mainly try to reconstruct input image from a corrupted one with a pixel-wise or feature-wise constraint along spatial axes.
We propose ConPreDiff to improve diffusion-based image synthesis with context prediction.
Our ConPreDiff consistently outperforms previous methods and achieves a new SOTA text-to-image generation results on MS-COCO, with a zero-shot FID score of 6.21.
arXiv Detail & Related papers (2024-01-04T01:10:56Z) - DreamDrone: Text-to-Image Diffusion Models are Zero-shot Perpetual View Generators [56.994967294931286]
We introduce DreamDrone, a novel zero-shot and training-free pipeline for generating flythrough scenes from textual prompts.
We advocate explicitly warping the intermediate latent code of the pre-trained text-to-image diffusion model for high-quality image generation and unbounded generalization ability.
arXiv Detail & Related papers (2023-12-14T08:42:26Z) - HiFi Tuner: High-Fidelity Subject-Driven Fine-Tuning for Diffusion
Models [56.112302700630806]
We introduce an innovative algorithm named HiFi Tuner to enhance the appearance preservation of objects during personalized image generation.
Key enhancements include the utilization of mask guidance, a novel parameter regularization technique, and the incorporation of step-wise subject representations.
We extend our method to a novel image editing task: substituting the subject in an image through textual manipulations.
arXiv Detail & Related papers (2023-11-30T02:33:29Z) - Panini-Net: GAN Prior Based Degradation-Aware Feature Interpolation for
Face Restoration [4.244692655670362]
Panini-Net is a degradation-aware feature network for face restoration.
It learns the abstract representations to distinguish various degradations.
It achieves state-of-the-art performance for multi-degradation face restoration and face super-resolution.
arXiv Detail & Related papers (2022-03-16T07:41:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.