ActVAR: Activating Mixtures of Weights and Tokens for Efficient Visual Autoregressive Generation
- URL: http://arxiv.org/abs/2511.12893v1
- Date: Mon, 17 Nov 2025 02:28:06 GMT
- Title: ActVAR: Activating Mixtures of Weights and Tokens for Efficient Visual Autoregressive Generation
- Authors: Kaixin Zhang, Ruiqing Yang, Yuan Zhang, Shan You, Tao Huang,
- Abstract summary: Existing static pruning methods degrade performance by permanently removing weights or tokens.<n>We propose Act VAR, a dynamic activation framework that introduces dual sparsity across model weights and token sequences.<n>Experiments on the ImageNet $256times 256$ benchmark demonstrate that Act VAR achieves up to $21.2%$ FLOPs reduction with minimal performance degradation.
- Score: 24.639936266140385
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual Autoregressive (VAR) models enable efficient image generation via next-scale prediction but face escalating computational costs as sequence length grows. Existing static pruning methods degrade performance by permanently removing weights or tokens, disrupting pretrained dependencies. To address this, we propose ActVAR, a dynamic activation framework that introduces dual sparsity across model weights and token sequences to enhance efficiency without sacrificing capacity. ActVAR decomposes feedforward networks (FFNs) into lightweight expert sub-networks and employs a learnable router to dynamically select token-specific expert subsets based on content. Simultaneously, a gated token selector identifies high-update-potential tokens for computation while reconstructing unselected tokens to preserve global context and sequence alignment. Training employs a two-stage knowledge distillation strategy, where the original VAR model supervises the learning of routing and gating policies to align with pretrained knowledge. Experiments on the ImageNet $256\times 256$ benchmark demonstrate that ActVAR achieves up to $21.2\%$ FLOPs reduction with minimal performance degradation.
Related papers
- ResTok: Learning Hierarchical Residuals in 1D Visual Tokenizers for Autoregressive Image Generation [64.84095852784714]
Residual Tokenizer (ResTok) is a 1D visual tokenizer that builds hierarchical residuals for both image tokens and latent tokens.<n>We show that restoring hierarchical residual priors in visual tokenization significantly improves AR image generation, achieving a gFID of 2.34 on ImageNet-256 with only 9 sampling steps.
arXiv Detail & Related papers (2026-01-07T14:09:18Z) - Elastic ViTs from Pretrained Models without Retraining [74.5386166956142]
Vision foundation models achieve remarkable performance but are only available in a limited set of pre-determined sizes.<n>We introduce SnapViT: Single-shot network approximation for pruned Vision Transformers.<n>Our approach efficiently combines gradient information with cross-network structure correlations, approximated via an evolutionary algorithm.
arXiv Detail & Related papers (2025-10-20T16:15:03Z) - Representation Entanglement for Generation: Training Diffusion Transformers Is Much Easier Than You Think [63.25744258438214]
REPA and its variants effectively mitigate training challenges in diffusion models by incorporating external visual representations from pretrained models.<n>We argue that the external alignment, which is absent during the entire denoising inference process, falls short of fully harnessing the potential of discriminative representations.<n>We propose Representation Entanglement for Generation (REG), which entangles low-level image latents with a single high-level class token from pretrained foundation models for denoising.
arXiv Detail & Related papers (2025-07-02T08:29:18Z) - ESLM: Risk-Averse Selective Language Modeling for Efficient Pretraining [53.893792844055106]
Large language model pretraining is compute-intensive, yet many tokens contribute marginally to learning, resulting in inefficiency.<n>We introduce Selective Efficient Language Modeling, a risk-aware algorithm that improves training efficiency and distributional robustness by performing online token-level batch selection.<n> Experiments on GPT-2 pretraining show that ESLM significantly reduces training FLOPs while maintaining or improving both perplexity and downstream performance compared to baselines.
arXiv Detail & Related papers (2025-05-26T12:23:26Z) - DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs [124.52164183968145]
We present DyMU, an efficient, training-free framework that reduces the computational burden of vision-language models (VLMs)<n>Our approach comprises two key components. First, Dynamic Token Merging (DToMe) reduces the number of visual token embeddings by merging similar tokens based on image complexity.<n>Second, Virtual Token Unmerging (VTU) simulates the expected token sequence for large language models (LLMs) by efficiently reconstructing the attention dynamics of a full sequence.
arXiv Detail & Related papers (2025-04-23T18:38:18Z) - Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning [4.051777802443125]
Sparse Autoencoders (SAEs) are a promising approach for extracting neural network representations.<n>We introduce Gradient SAEs, which modify the $k$-sparse autoencoder architecture by augmenting the TopK activation function.<n>We find evidence that g-SAEs learn latents that are on average more effective at steering models in arbitrary contexts.
arXiv Detail & Related papers (2024-11-15T18:03:52Z) - Accurate and Structured Pruning for Efficient Automatic Speech
Recognition [23.897482741744117]
We propose a novel compression strategy to reduce the model size and inference cost of the Conformer model.
Our method achieves a 50% reduction in model size and a 28% reduction in inference cost with minimal performance loss.
arXiv Detail & Related papers (2023-05-31T04:31:16Z) - Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers [29.319666323947708]
We present a novel approach that dynamically prunes contextual information while preserving the model's expressiveness.
Our method employs a learnable mechanism that determines which uninformative tokens can be dropped from the context.
Our reference implementation achieves up to $2times$ increase in inference throughput and even greater memory savings.
arXiv Detail & Related papers (2023-05-25T07:39:41Z) - Learning a Consensus Sub-Network with Polarization Regularization and One Pass Training [2.895034191799291]
Pruning schemes create extra overhead either by iterative training and fine-tuning for static pruning or repeated computation of a dynamic pruning graph.<n>We propose a new parameter pruning strategy for learning a lighter-weight sub-network that minimizes the energy cost while maintaining comparable performance to the fully parameterised network on given downstream tasks.<n>Our results on CIFAR-10, CIFAR-100, and Tiny Imagenet suggest that our scheme can remove 50% of connections in deep networks with 1% reduction in classification accuracy.
arXiv Detail & Related papers (2023-02-17T09:37:17Z) - Online Convolutional Re-parameterization [51.97831675242173]
We present online convolutional re- parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution.
Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2x.
We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks.
arXiv Detail & Related papers (2022-04-02T09:50:19Z) - IB-DRR: Incremental Learning with Information-Back Discrete
Representation Replay [4.8666876477091865]
Incremental learning aims to enable machine learning models to continuously acquire new knowledge given new classes.
Saving a subset of training samples of previously seen classes in the memory and replaying them during new training phases is proven to be an efficient and effective way to fulfil this aim.
However, finding a trade-off between the model performance and the number of samples to save for each class is still an open problem for replay-based incremental learning.
arXiv Detail & Related papers (2021-04-21T15:32:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.