ScaleWeaver: Weaving Efficient Controllable T2I Generation with Multi-Scale Reference Attention
- URL: http://arxiv.org/abs/2510.14882v1
- Date: Thu, 16 Oct 2025 17:00:59 GMT
- Title: ScaleWeaver: Weaving Efficient Controllable T2I Generation with Multi-Scale Reference Attention
- Authors: Keli Liu, Zhendong Wang, Wengang Zhou, Shaodong Xu, Ruixiao Dong, Houqiang Li,
- Abstract summary: ScaleWeaver is a framework designed to achieve high-fidelity, controllable generation upon advanced visual autoregressive( VAR) models.<n>The proposed Reference Attention module discards the unnecessary attention from image$rightarrow$condition, reducing computational cost.<n>Experiments show that ScaleWeaver delivers high-quality generation and precise control while attaining superior efficiency over diffusion-based methods.
- Score: 86.93601565563954
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-image generation with visual autoregressive~(VAR) models has recently achieved impressive advances in generation fidelity and inference efficiency. While control mechanisms have been explored for diffusion models, enabling precise and flexible control within VAR paradigm remains underexplored. To bridge this critical gap, in this paper, we introduce ScaleWeaver, a novel framework designed to achieve high-fidelity, controllable generation upon advanced VAR models through parameter-efficient fine-tuning. The core module in ScaleWeaver is the improved MMDiT block with the proposed Reference Attention module, which efficiently and effectively incorporates conditional information. Different from MM Attention, the proposed Reference Attention module discards the unnecessary attention from image$\rightarrow$condition, reducing computational cost while stabilizing control injection. Besides, it strategically emphasizes parameter reuse, leveraging the capability of the VAR backbone itself with a few introduced parameters to process control information, and equipping a zero-initialized linear projection to ensure that control signals are incorporated effectively without disrupting the generative capability of the base model. Extensive experiments show that ScaleWeaver delivers high-quality generation and precise control while attaining superior efficiency over diffusion-based methods, making ScaleWeaver a practical and effective solution for controllable text-to-image generation within the visual autoregressive paradigm. Code and models will be released.
Related papers
- NanoControl: A Lightweight Framework for Precise and Efficient Control in Diffusion Transformer [14.644014499085943]
NanoControl employs Flux as the backbone network for controllable text-to-image generation.<n>Our model achieves state-of-the-art controllable text-to-image generation performance.<n>It incurs only a 0.024% increase in parameter count and a 0.029% increase in GFLOPs, thus enabling highly efficient controllable generation.
arXiv Detail & Related papers (2025-08-14T07:54:44Z) - SCALAR: Scale-wise Controllable Visual Autoregressive Learning [15.775596699630633]
We present SCALAR, a controllable generation method based on Visual Autoregressive ( VAR)<n>We leverage a pretrained image encoder to extract semantic control signal encodings, which are projected into scale-specific representations and injected into the corresponding layers of the VAR backbone.<n>Building on SCALAR, we develop SCALAR-Uni, a unified extension that aligns multiple control modalities into a shared latent space, supporting flexible multi-conditional guidance in a single model.
arXiv Detail & Related papers (2025-07-26T13:23:08Z) - Structural Similarity-Inspired Unfolding for Lightweight Image Super-Resolution [88.20464308588889]
We propose a Structural Similarity-Inspired Unfolding (SSIU) method for efficient image SR.<n>This method is designed through unfolding an SR optimization function constrained by structural similarity.<n>Our model outperforms current state-of-the-art models, boasting lower parameter counts and reduced memory consumption.
arXiv Detail & Related papers (2025-06-13T14:29:40Z) - CAR: Controllable Autoregressive Modeling for Visual Generation [100.33455832783416]
Controllable AutoRegressive Modeling (CAR) is a novel, plug-and-play framework that integrates conditional control into multi-scale latent variable modeling.
CAR progressively refines and captures control representations, which are injected into each autoregressive step of the pre-trained model to guide the generation process.
Our approach demonstrates excellent controllability across various types of conditions and delivers higher image quality compared to previous methods.
arXiv Detail & Related papers (2024-10-07T00:55:42Z) - ControlVAR: Exploring Controllable Visual Autoregressive Modeling [48.66209303617063]
Conditional visual generation has witnessed remarkable progress with the advent of diffusion models (DMs)
Challenges such as expensive computational cost, high inference latency, and difficulties of integration with large language models (LLMs) have necessitated exploring alternatives to DMs.
This paper introduces Controlmore, a novel framework that explores pixel-level controls in visual autoregressive modeling for flexible and efficient conditional generation.
arXiv Detail & Related papers (2024-06-14T06:35:33Z) - FullLoRA: Efficiently Boosting the Robustness of Pretrained Vision Transformers [72.83770102062141]
Vision Transformer (ViT) model has gradually become mainstream in various computer vision tasks.<n>Existing large models tend to prioritize performance during training, potentially neglecting the robustness.<n>We develop novel LNLoRA module, incorporating a learnable layer normalization before the conventional LoRA module.<n>We propose the FullLoRA framework by integrating the learnable LNLoRA modules into all key components of ViT-based models.
arXiv Detail & Related papers (2024-01-03T14:08:39Z) - Controllability-Constrained Deep Network Models for Enhanced Control of
Dynamical Systems [4.948174943314265]
Control of a dynamical system without the knowledge of dynamics is an important and challenging task.
Modern machine learning approaches, such as deep neural networks (DNNs), allow for the estimation of a dynamics model from control inputs and corresponding state observation outputs.
We propose a control-theoretical method that explicitly enhances models estimated from data with controllability.
arXiv Detail & Related papers (2023-11-11T00:04:26Z) - Can SAM Boost Video Super-Resolution? [78.29033914169025]
We propose a simple yet effective module -- SAM-guidEd refinEment Module (SEEM)
This light-weight plug-in module is specifically designed to leverage the attention mechanism for the generation of semantic-aware feature.
We apply our SEEM to two representative methods, EDVR and BasicVSR, resulting in consistently improved performance with minimal implementation effort.
arXiv Detail & Related papers (2023-05-11T02:02:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.