VA-$π$: Variational Policy Alignment for Pixel-Aware Autoregressive Generation
- URL: http://arxiv.org/abs/2512.19680v1
- Date: Mon, 22 Dec 2025 18:54:30 GMT
- Title: VA-$π$: Variational Policy Alignment for Pixel-Aware Autoregressive Generation
- Authors: Xinyao Liao, Qiyuan He, Kai Xu, Xiaoye Qu, Yicong Li, Wei Wei, Angela Yao,
- Abstract summary: VA-$$ is a post-training framework to optimize autoregressive visual generation.<n>It unifies pixel reconstruction and autoregressive modeling.<n>It reduces FID from 14.36 to 7.65 and improves IS from 86.55 to 116.70 on LlamaGen-XXL.
- Score: 65.22452273252428
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autoregressive (AR) visual generation relies on tokenizers to map images to and from discrete sequences. However, tokenizers are trained to reconstruct clean images from ground-truth tokens, while AR generators are optimized only for token likelihood. This misalignment leads to generated token sequences that may decode into low-quality images, without direct supervision from the pixel space. We propose VA-$π$, a lightweight post-training framework that directly optimizes AR models with a principled pixel-space objective. VA-$π$ formulates the generator-tokenizer alignment as a variational optimization, deriving an evidence lower bound (ELBO) that unifies pixel reconstruction and autoregressive modeling. To optimize under the discrete token space, VA-$π$ introduces a reinforcement-based alignment strategy that treats the AR generator as a policy, uses pixel-space reconstruction quality as its intrinsic reward. The reward is measured by how well the predicted token sequences can reconstruct the original image under teacher forcing, giving the model direct pixel-level guidance without expensive free-running sampling. The regularization term of the ELBO serves as a natural regularizer, maintaining distributional consistency of tokens. VA-$π$ enables rapid adaptation of existing AR generators, without neither tokenizer retraining nor external reward models. With only 1% ImageNet-1K data and 25 minutes of tuning, it reduces FID from 14.36 to 7.65 and improves IS from 86.55 to 116.70 on LlamaGen-XXL, while also yielding notable gains in the text-to-image task on GenEval for both visual generation model (LlamaGen: from 0.306 to 0.339) and unified multi-modal model (Janus-Pro: from 0.725 to 0.744). Code is available at https://github.com/Lil-Shake/VA-Pi.
Related papers
- REAR: Rethinking Visual Autoregressive Models via Generator-Tokenizer Consistency Regularization [130.46612643194973]
reAR is a simple training strategy introducing a token-wise regularization objective.<n>On ImageNet, it reduces gFID from 3.02 to 1.86 and improves IS to 316.9 using a standardization-based tokenizer.<n>When applied to advanced tokenizers, it achieves a gFID of 1.42 with only 177M parameters, matching the performance with larger state-of-the-art diffusion models (675M)
arXiv Detail & Related papers (2025-10-06T02:48:13Z) - Group Critical-token Policy Optimization for Autoregressive Image Generation [32.472222192052044]
Key obstacle lies in how to identify more critical image tokens during AR generation and implement effective token-wise optimization for them.<n>We identify the critical tokens in RLVR-based AR generation from three perspectives, specifically: $textbf(1)$ Causal dependency: early tokens fundamentally determine the later tokens and final image effect due to unidirectional dependency; $textbf(2)$ Entropy-induced spatial structure: tokens with high entropy gradients correspond to image structure and bridges distinct visual regions.<n>Experiments on multiple text-to-image benchmarks for both AR models and unified multimodal models demonstrate the effectiveness
arXiv Detail & Related papers (2025-09-26T15:33:18Z) - Representation Entanglement for Generation: Training Diffusion Transformers Is Much Easier Than You Think [63.25744258438214]
REPA and its variants effectively mitigate training challenges in diffusion models by incorporating external visual representations from pretrained models.<n>We argue that the external alignment, which is absent during the entire denoising inference process, falls short of fully harnessing the potential of discriminative representations.<n>We propose Representation Entanglement for Generation (REG), which entangles low-level image latents with a single high-level class token from pretrained foundation models for denoising.
arXiv Detail & Related papers (2025-07-02T08:29:18Z) - Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer.<n>Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis.<n>In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z) - Next Patch Prediction for Autoregressive Visual Generation [58.73461205369825]
We extend the Next Token Prediction (NTP) paradigm to a novel Next Patch Prediction (NPP) paradigm.<n>Our key idea is to group and aggregate image tokens into patch tokens with higher information density.<n>We show that NPP could reduce the training cost to around 0.6 times while improving image generation quality by up to 1.0 FID score on the ImageNet 256x256 generation benchmark.
arXiv Detail & Related papers (2024-12-19T18:59:36Z) - ZipAR: Parallel Auto-regressive Image Generation through Spatial Locality [19.486745219466666]
ZipAR is a training-free, plug-and-play parallel decoding framework for auto-regressive (AR) visual generation.<n>ZipAR can reduce the number of model forward passes by up to 91% on the Emu3-Gen model without requiring any additional retraining.
arXiv Detail & Related papers (2024-12-05T10:57:08Z) - RandAR: Decoder-only Autoregressive Visual Generation in Random Orders [54.49937384788739]
RandAR is a decoder-only visual autoregressive (AR) model capable of generating images in arbitrary token orders.<n>Our design enables random order by inserting a "position instruction token" before each image token to be predicted.<n>RandAR supports inpainting, outpainting and resolution inference in a zero-shot manner.
arXiv Detail & Related papers (2024-12-02T18:59:53Z) - Randomized Autoregressive Visual Generation [26.195148077398223]
This paper presents Randomized AutoRegressive modeling (RAR) for visual generation.
RAR sets a new state-of-the-art performance on the image generation task while maintaining full compatibility with language modeling frameworks.
On the ImageNet-256 benchmark, RAR achieves an FID score of 1.48, not only surpassing prior state-the-art autoregressive image generators but also outperforming leading diffusion-based and masked transformer-based methods.
arXiv Detail & Related papers (2024-11-01T17:59:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.