Semantic Context Matters: Improving Conditioning for Autoregressive Models
- URL: http://arxiv.org/abs/2511.14063v1
- Date: Tue, 18 Nov 2025 02:42:24 GMT
- Title: Semantic Context Matters: Improving Conditioning for Autoregressive Models
- Authors: Dongyang Jin, Ryan Xu, Jianhao Zeng, Rui Lan, Yancheng Bai, Lei Sun, Xiangxiang Chu,
- Abstract summary: We propose SCAR, a Semantic-Context-driven method for Autoregressive models.<n>S SCAR introduces two key components: Compressed Semantic Prefilling and Semantic Alignment Guidance.<n>S SCAR achieves superior visual fidelity and semantic alignment on both instruction editing and controllable generation benchmarks.
- Score: 19.768966373880563
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, autoregressive (AR) models have shown strong potential in image generation, offering better scalability and easier integration with unified multi-modal systems compared to diffusion-based methods. However, extending AR models to general image editing remains challenging due to weak and inefficient conditioning, often leading to poor instruction adherence and visual artifacts. To address this, we propose SCAR, a Semantic-Context-driven method for Autoregressive models. SCAR introduces two key components: Compressed Semantic Prefilling, which encodes high-level semantics into a compact and efficient prefix, and Semantic Alignment Guidance, which aligns the last visual hidden states with target semantics during autoregressive decoding to enhance instruction fidelity. Unlike decoding-stage injection methods, SCAR builds upon the flexibility and generality of vector-quantized-based prefilling while overcoming its semantic limitations and high cost. It generalizes across both next-token and next-set AR paradigms with minimal architectural changes. SCAR achieves superior visual fidelity and semantic alignment on both instruction editing and controllable generation benchmarks, outperforming prior AR-based methods while maintaining controllability. All code will be released.
Related papers
- DreamVAR: Taming Reinforced Visual Autoregressive Model for High-Fidelity Subject-Driven Image Generation [108.71044040025374]
We present a novel framework for subject-driven image synthesis built upon a Visual Autoregressive model that employs next-scale prediction.<n>We show that Dreamthe achieves superior appearance preservation compared to leading diffusion-based methods.
arXiv Detail & Related papers (2026-01-30T03:32:29Z) - Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations [53.91818843831925]
We propose NExT-Vid, a novel autoregressive visual generative pretraining framework.<n>We introduce a context-isolated autoregressive predictor to decouple semantic representation from target decoding.<n>Through context-isolated flow-matching pretraining, our approach achieves strong representations.
arXiv Detail & Related papers (2025-12-24T07:07:08Z) - VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction [83.50898344094153]
VQRAE produces Continuous semantic features for image understanding and tokens for visual generation within a unified tokenizer.<n>Design enables negligible semantic information for maintaining the ability of multimodal understanding, discrete tokens.<n>VQRAE presents competitive performance on several benchmarks of visual understanding, generation and reconstruction.
arXiv Detail & Related papers (2025-11-28T17:26:34Z) - Rethinking Autoregressive Models for Lossless Image Compression via Hierarchical Parallelism and Progressive Adaptation [75.58269386927076]
Autoregressive (AR) models are often dismissed as impractical due to prohibitive computational cost.<n>This work re-thinks this paradigm, introducing a framework built on hierarchical parallelism and progressive adaptation.<n> Experiments on diverse datasets (natural, satellite, medical) validate that our method achieves new state-of-the-art compression.
arXiv Detail & Related papers (2025-11-14T06:27:58Z) - IAR2: Improving Autoregressive Visual Generation with Semantic-Detail Associated Token Prediction [77.06211178777939]
IAR2 is an advanced autoregressive framework that enables a hierarchical semantic-detail synthesis process.<n>We show that IAR2 sets a new state-of-the-art for autoregressive image generation, achieving a FID of 1.50 on ImageNet.
arXiv Detail & Related papers (2025-10-08T12:08:21Z) - Growing Visual Generative Capacity for Pre-Trained MLLMs [60.826355079902505]
Bridge is a pure autoregressive unified MLLM that augments pre-trained visual understanding models with generative ability.<n>We propose a semantic-to-pixel discrete representation that integrates compact semantic tokens with fine-grained pixel tokens.
arXiv Detail & Related papers (2025-10-02T00:40:02Z) - EchoGen: Generating Visual Echoes in Any Scene via Feed-Forward Subject-Driven Auto-Regressive Model [56.53617289548353]
EchoGen is a pioneering framework that empowers Visual Auto-Regressive ( VAR) models with subject-driven generation capabilities.<n>We employ a semantic encoder to extract the subject's abstract identity, which is injected through decoupled cross-attention to guide the overall composition.<n>To the best of our knowledge, EchoGen is the first feed-forward subject-driven framework built upon VAR models.
arXiv Detail & Related papers (2025-09-30T11:45:48Z) - CoAR: Concept Injection into Autoregressive Models for Personalized Text-to-Image Generation [14.820840831692246]
CoAR learns effective, specific subject representations with only a minimal number of parameters.<n>Experiments demonstrate that CoAR achieves superior performance on both subject-driven personalization and style personalization.
arXiv Detail & Related papers (2025-08-10T13:36:39Z) - Quantize More, Lose Less: Autoregressive Generation from Residually Quantized Speech Representations [26.938560887095658]
Existing autoregressive approaches often rely on single-codebook representations, which suffer from significant information loss.<n>We propose QTTS, a novel TTS framework built upon our new audio, QDAC.<n>Our experiments demonstrate that the proposed framework achieves higher synthesis quality and better preserves expressive content compared to baseline.
arXiv Detail & Related papers (2025-07-16T12:47:09Z) - EAR: Erasing Concepts from Unified Autoregressive Models [3.55166983092355]
We propose Erasure Autoregressive Model (EAR), a fine-tuning method for effective and utility-preserving concept erasure in AR models.<n>Specifically, we introduce Windowed Gradient Accumulation (WGA) strategy to align patch-level decoding with erasure objectives.<n>We also propose a novel benchmark, Erase Concept Generator and Visual Filter (ECGVF), aim at provide a more rigorous and comprehensive foundation for evaluating concept erasure in AR models.
arXiv Detail & Related papers (2025-06-25T06:15:07Z) - CCL-LGS: Contrastive Codebook Learning for 3D Language Gaussian Splatting [53.15827818829865]
Methods that rely on 2D priors are prone to a critical challenge: cross-view semantic inconsistencies.<n>We propose CCL-LGS, a novel framework that enforces view-consistent semantic supervision by integrating multi-view semantic cues.<n>Our framework explicitly resolves semantic conflicts while preserving category discriminability.
arXiv Detail & Related papers (2025-05-26T19:09:33Z) - Explaining the role of Intrinsic Dimensionality in Adversarial Training [31.495803865226158]
We show that off-manifold adversarial examples (AEs) enhance robustness, while on-manifold AEs improve generalization.<n>We introduce SMAAT, which improves the scalability of AT for encoder-based models by perturbing the layer with the lowest intrinsic dimensionality.<n>We validate SMAAT across multiple tasks, including text generation, sentiment classification, safety filtering, and retrieval augmented generation setups.
arXiv Detail & Related papers (2024-05-27T12:48:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.