Related papers: Efficient Conditional Generation on Scale-based Visual Autoregressive Models

Efficient Conditional Generation on Scale-based Visual Autoregressive Models

URL: http://arxiv.org/abs/2510.05610v1
Date: Tue, 07 Oct 2025 06:27:03 GMT
Title: Efficient Conditional Generation on Scale-based Visual Autoregressive Models
Authors: Jiaqi Liu, Tao Huang, Chang Xu,
Abstract summary: Efficient Control Model (ECM) is a plug-and-play framework featuring a lightweight control module that introduces control signals via a distributed architecture.<n> ECM refines conditional features using real-time generated tokens, and a shared feed-forward network (FFN) designed to maximize the utilization of its limited capacity.<n>Our method achieves high-fidelity and diverse control over image generation, surpassing existing baselines while significantly improving both training and inference efficiency.
Score: 26.81493253536486
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in autoregressive (AR) models have demonstrated their potential to rival diffusion models in image synthesis. However, for complex spatially-conditioned generation, current AR approaches rely on fine-tuning the pre-trained model, leading to significant training costs. In this paper, we propose the Efficient Control Model (ECM), a plug-and-play framework featuring a lightweight control module that introduces control signals via a distributed architecture. This architecture consists of context-aware attention layers that refine conditional features using real-time generated tokens, and a shared gated feed-forward network (FFN) designed to maximize the utilization of its limited capacity and ensure coherent control feature learning. Furthermore, recognizing the critical role of early-stage generation in determining semantic structure, we introduce an early-centric sampling strategy that prioritizes learning early control sequences. This approach reduces computational cost by lowering the number of training tokens per iteration, while a complementary temperature scheduling during inference compensates for the resulting insufficient training of late-stage tokens. Extensive experiments on scale-based AR models validate that our method achieves high-fidelity and diverse control over image generation, surpassing existing baselines while significantly improving both training and inference efficiency.

Related papers

StepVAR: Structure-Texture Guided Pruning for Visual Autoregressive Models [98.72926158261937]
We propose a training-free token pruning framework for Visual AutoRegressive models.<n>We employ a lightweight high-pass filter to capture local texture details, while leveraging Principal Component Analysis (PCA) to preserve global structural information.<n>To maintain valid next-scale prediction under sparse tokens, we introduce a nearest neighbor feature propagation strategy.
arXiv Detail & Related papers (2026-03-02T11:35:05Z)
Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning [61.380634253724594]
Large-scale autoregressive models pretrained on next-token prediction and finetuned with reinforcement learning (RL)<n>We show that it is possible to overcome this problem by acting and exploring within the internal representations of an autoregressive model.
arXiv Detail & Related papers (2025-12-23T18:51:50Z)
ScaleWeaver: Weaving Efficient Controllable T2I Generation with Multi-Scale Reference Attention [86.93601565563954]
ScaleWeaver is a framework designed to achieve high-fidelity, controllable generation upon advanced visual autoregressive( VAR) models.<n>The proposed Reference Attention module discards the unnecessary attention from image$rightarrow$condition, reducing computational cost.<n>Experiments show that ScaleWeaver delivers high-quality generation and precise control while attaining superior efficiency over diffusion-based methods.
arXiv Detail & Related papers (2025-10-16T17:00:59Z)
Towards Efficient General Feature Prediction in Masked Skeleton Modeling [59.46799426434277]
We propose a novel General Feature Prediction framework (GFP) for efficient mask skeleton modeling.<n>Our key innovation is replacing conventional low-level reconstruction with high-level feature prediction that spans from local motion patterns to global semantic representations.
arXiv Detail & Related papers (2025-09-03T18:05:02Z)
FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities [76.46448367752944]
multimodal large language models (MLLMs) unify visual understanding and image generation within a single framework.<n>Most existing MLLMs rely on autore (AR) architectures, which impose inherent limitations on future development.<n>We introduce FUDOKI, a unified multimodal model purely based on discrete flow matching.
arXiv Detail & Related papers (2025-05-26T15:46:53Z)
KDC-Diff: A Latent-Aware Diffusion Model with Knowledge Retention for Memory-Efficient Image Generation [2.0250638970950905]
KDC-Diff is a novel and scalable generative framework designed to significantly reduce computational overhead while maintaining high performance.<n>Our model demonstrates strong performance across FID, CLIP, KID, and LPIPS metrics while achieving substantial reductions in parameter count, inference time, and FLOPs.
arXiv Detail & Related papers (2025-05-11T14:40:51Z)
Large EEG-U-Transformer for Time-Step Level Detection Without Pre-Training [1.3254304182988286]
We propose a simple U-shaped model to efficiently learn representations by capturing both local and global features.<n>Compared to other window-level classification models, our method directly outputs predictions at the time-step level.<n>Our model won 1st place in the 2025 "seizure detection challenge" organized in the International Conference on Artificial Intelligence in Epilepsy and Other Neurological Disorders.
arXiv Detail & Related papers (2025-04-01T01:33:42Z)
GPT-ST: Generative Pre-Training of Spatio-Temporal Graph Neural Networks [24.323017830938394]
This work aims to address challenges by introducing a pre-training framework that seamlessly integrates with baselines and enhances their performance. The framework is built upon two key designs: (i) We propose a. apple-to-apple mask autoencoder as a pre-training model for learning-temporal dependencies. These modules are specifically designed to capture intra-temporal customized representations and semantic- and inter-cluster relationships.
arXiv Detail & Related papers (2023-11-07T02:36:24Z)
Training dynamic models using early exits for automatic speech recognition on resource-constrained devices [15.879328412777008]
Early-exit architectures enable the development of dynamic models capable of adapting their size and architecture to varying levels of computational resources and ASR performance demands. We show that early-exit models trained from scratch not only preserve performance when using fewer encoder layers but also exhibit enhanced task accuracy compared to single-exit or pre-trained models. Results provide insights into the training dynamics of early-exit architectures for ASR models.
arXiv Detail & Related papers (2023-09-18T07:45:16Z)
Exploiting Diffusion Prior for Real-World Image Super-Resolution [75.5898357277047]
We present a novel approach to leverage prior knowledge encapsulated in pre-trained text-to-image diffusion models for blind super-resolution. By employing our time-aware encoder, we can achieve promising restoration results without altering the pre-trained synthesis model.
arXiv Detail & Related papers (2023-05-11T17:55:25Z)
RLFlow: Optimising Neural Network Subgraph Transformation with World Models [0.0]
We propose a model-based agent which learns to optimise the architecture of neural networks by performing a sequence of subgraph transformations to reduce model runtime. We show our approach can match the performance of state of the art on common convolutional networks and outperform those by up to 5% on transformer-style architectures.
arXiv Detail & Related papers (2022-05-03T11:52:54Z)
Normalizing Flows with Multi-Scale Autoregressive Priors [131.895570212956]
We introduce channel-wise dependencies in their latent space through multi-scale autoregressive priors (mAR) Our mAR prior for models with split coupling flow layers (mAR-SCF) can better capture dependencies in complex multimodal data. We show that mAR-SCF allows for improved image generation quality, with gains in FID and Inception scores compared to state-of-the-art flow-based models.
arXiv Detail & Related papers (2020-04-08T09:07:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.