Progressive Supernet Training for Efficient Visual Autoregressive Modeling
- URL: http://arxiv.org/abs/2511.16546v1
- Date: Thu, 20 Nov 2025 16:59:24 GMT
- Title: Progressive Supernet Training for Efficient Visual Autoregressive Modeling
- Authors: Xiaoyue Chen, Yuling Shi, Kaiyuan Li, Huandong Wang, Yong Li, Xiaodong Gu, Xinlei Chen, Mingbao Lin,
- Abstract summary: We propose a training strategy that breaks through the frontier of generation quality for both paradigms and the full network.<n>Experiments on ImageNet demonstrate that, compared to the pretrained VAR-d30, VARiant-d16 and VARiant-d8 achieve nearly equivalent quality.<n> VARiant-d2 achieves 3.5 times speedup and 80% memory reduction at moderate quality cost.
- Score: 56.15415456746672
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual Auto-Regressive (VAR) models significantly reduce inference steps through the "next-scale" prediction paradigm. However, progressive multi-scale generation incurs substantial memory overhead due to cumulative KV caching, limiting practical deployment. We observe a scale-depth asymmetric dependency in VAR: early scales exhibit extreme sensitivity to network depth, while later scales remain robust to depth reduction. Inspired by this, we propose VARiant: by equidistant sampling, we select multiple subnets ranging from 16 to 2 layers from the original 30-layer VAR-d30 network. Early scales are processed by the full network, while later scales utilize subnet. Subnet and the full network share weights, enabling flexible depth adjustment within a single model. However, weight sharing between subnet and the entire network can lead to optimization conflicts. To address this, we propose a progressive training strategy that breaks through the Pareto frontier of generation quality for both subnets and the full network under fixed-ratio training, achieving joint optimality. Experiments on ImageNet demonstrate that, compared to the pretrained VAR-d30 (FID 1.95), VARiant-d16 and VARiant-d8 achieve nearly equivalent quality (FID 2.05/2.12) while reducing memory consumption by 40-65%. VARiant-d2 achieves 3.5 times speedup and 80% memory reduction at moderate quality cost (FID 2.97). In terms of deployment, VARiant's single-model architecture supports zero-cost runtime depth switching and provides flexible deployment options from high quality to extreme efficiency, catering to diverse application scenarios.
Related papers
- PSE-Net: Channel Pruning for Convolutional Neural Networks with Parallel-subnets Estimator [16.698190973547362]
We introduce PSE-Net, a novel parallel-subnets estimator for efficient channel pruning.
Our proposed algorithm facilitates the efficiency of supernet training.
We develop a prior-distributed-based sampling algorithm to boost the performance of classical evolutionary search.
arXiv Detail & Related papers (2024-08-29T03:20:43Z) - Dynamic Pre-training: Towards Efficient and Scalable All-in-One Image Restoration [100.54419875604721]
All-in-one image restoration tackles different types of degradations with a unified model instead of having task-specific, non-generic models for each degradation.
We propose DyNet, a dynamic family of networks designed in an encoder-decoder style for all-in-one image restoration tasks.
Our DyNet can seamlessly switch between its bulkier and lightweight variants, thereby offering flexibility for efficient model deployment.
arXiv Detail & Related papers (2024-04-02T17:58:49Z) - Lightweight and Progressively-Scalable Networks for Semantic
Segmentation [100.63114424262234]
Multi-scale learning frameworks have been regarded as a capable class of models to boost semantic segmentation.
In this paper, we thoroughly analyze the design of convolutional blocks and the ways of interactions across multiple scales.
We devise Lightweight and Progressively-Scalable Networks (LPS-Net) that novelly expands the network complexity in a greedy manner.
arXiv Detail & Related papers (2022-07-27T16:00:28Z) - Compact Multi-level Sparse Neural Networks with Input Independent
Dynamic Rerouting [33.35713740886292]
Sparse deep neural networks can substantially reduce the complexity and memory consumption of the models.
Facing the real-life challenges, we propose to train a sparse model that supports multiple sparse levels.
In this way, one can dynamically select the appropriate sparsity level during inference, while the storage cost is capped by the least sparse sub-model.
arXiv Detail & Related papers (2021-12-21T01:35:51Z) - DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and
Transformers [105.74546828182834]
We show a hardware-efficient dynamic inference regime, named dynamic weight slicing, which adaptively slice a part of network parameters for inputs with diverse difficulty levels.
We present dynamic slimmable network (DS-Net) and dynamic slice-able network (DS-Net++) by input-dependently adjusting filter numbers of CNNs and multiple dimensions in both CNNs and transformers.
arXiv Detail & Related papers (2021-09-21T09:57:21Z) - Dynamic Slimmable Network [105.74546828182834]
We develop a dynamic network slimming regime named Dynamic Slimmable Network (DS-Net)
Our DS-Net is empowered with the ability of dynamic inference by the proposed double-headed dynamic gate.
It consistently outperforms its static counterparts as well as state-of-the-art static and dynamic model compression methods.
arXiv Detail & Related papers (2021-03-24T15:25:20Z) - Manifold Regularized Dynamic Network Pruning [102.24146031250034]
This paper proposes a new paradigm that dynamically removes redundant filters by embedding the manifold information of all instances into the space of pruned networks.
The effectiveness of the proposed method is verified on several benchmarks, which shows better performance in terms of both accuracy and computational cost.
arXiv Detail & Related papers (2021-03-10T03:59:03Z) - Fully Dynamic Inference with Deep Neural Networks [19.833242253397206]
Two compact networks, called Layer-Net (L-Net) and Channel-Net (C-Net), predict on a per-instance basis which layers or filters/channels are redundant and therefore should be skipped.
On the CIFAR-10 dataset, LC-Net results in up to 11.9$times$ fewer floating-point operations (FLOPs) and up to 3.3% higher accuracy compared to other dynamic inference methods.
On the ImageNet dataset, LC-Net achieves up to 1.4$times$ fewer FLOPs and up to 4.6% higher Top-1 accuracy than the other methods.
arXiv Detail & Related papers (2020-07-29T23:17:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.