ViT-Linearizer: Distilling Quadratic Knowledge into Linear-Time Vision Models
- URL: http://arxiv.org/abs/2504.00037v1
- Date: Sun, 30 Mar 2025 15:35:24 GMT
- Title: ViT-Linearizer: Distilling Quadratic Knowledge into Linear-Time Vision Models
- Authors: Guoyizhe Wei, Rama Chellappa,
- Abstract summary: Vision Transformers (ViTs) have delivered remarkable progress through global self-attention, yet their quadratic complexity can become prohibitive for high-resolution inputs.<n>We present ViT-Linearizer, a cross-architecture distillation framework that transfers rich ViT representations into a linear-time, recurrent-style model.<n>Our results underscore the good potential of RNN-based solutions for large-scale visual tasks, bridging the gap between theoretical efficiency and real-world practice.
- Score: 39.520825264698374
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformers (ViTs) have delivered remarkable progress through global self-attention, yet their quadratic complexity can become prohibitive for high-resolution inputs. In this work, we present ViT-Linearizer, a cross-architecture distillation framework that transfers rich ViT representations into a linear-time, recurrent-style model. Our approach leverages 1) activation matching, an intermediate constraint that encourages student to align its token-wise dependencies with those produced by the teacher, and 2) masked prediction, a contextual reconstruction objective that requires the student to predict the teacher's representations for unseen (masked) tokens, to effectively distill the quadratic self-attention knowledge into the student while maintaining efficient complexity. Empirically, our method provides notable speedups particularly for high-resolution tasks, significantly addressing the hardware challenges in inference. Additionally, it also elevates Mamba-based architectures' performance on standard vision benchmarks, achieving a competitive 84.3% top-1 accuracy on ImageNet with a base-sized model. Our results underscore the good potential of RNN-based solutions for large-scale visual tasks, bridging the gap between theoretical efficiency and real-world practice.
Related papers
- Underlying Semantic Diffusion for Effective and Efficient In-Context Learning [113.4003355229632]
Underlying Semantic Diffusion (US-Diffusion) is an enhanced diffusion model that boosts underlying semantics learning, computational efficiency, and in-context learning capabilities.
We present a Feedback-Aided Learning (FAL) framework, which leverages feedback signals to guide the model in capturing semantic details.
We also propose a plug-and-play Efficient Sampling Strategy (ESS) for dense sampling at time steps with high-noise levels.
arXiv Detail & Related papers (2025-03-06T03:06:22Z) - big.LITTLE Vision Transformer for Efficient Visual Recognition [34.015778625984055]
big.LITTLE Vision Transformer is an innovative architecture aimed at achieving efficient visual recognition.
System is composed of two distinct blocks: the big performance block and the LITTLE efficiency block.
When processing an image, our system determines the importance of each token and allocates them accordingly.
arXiv Detail & Related papers (2024-10-14T08:21:00Z) - Visual Prompt Tuning in Null Space for Continual Learning [51.96411454304625]
Existing prompt-tuning methods have demonstrated impressive performances in continual learning (CL)
This paper aims to learn each task by tuning the prompts in the direction orthogonal to the subspace spanned by previous tasks' features.
In practice, an effective null-space-based approximation solution has been proposed to implement the prompt gradient projection.
arXiv Detail & Related papers (2024-06-09T05:57:40Z) - ViR: Towards Efficient Vision Retention Backbones [97.93707844681893]
We propose a new class of computer vision models, dubbed Vision Retention Networks (ViR)
ViR has dual parallel and recurrent formulations, which strike an optimal balance between fast inference and parallel training with competitive performance.
We have validated the effectiveness of ViR through extensive experiments with different dataset sizes and various image resolutions.
arXiv Detail & Related papers (2023-10-30T16:55:50Z) - Unsupervised Object-Centric Learning with Bi-Level Optimized Query Slot
Attention [26.25900877220557]
Slot-Attention module has played an important role with its simple yet effective design and fostered many powerful variants.
We propose to address these issues by initializing Slot-Attention modules with learnable queries and (2) optimizing the model with bi-level optimization.
Our model achieves state-of-the-art results on both synthetic and complex real-world datasets in unsupervised image segmentation and reconstruction.
arXiv Detail & Related papers (2022-10-17T12:14:59Z) - Dynamic Contrastive Distillation for Image-Text Retrieval [90.05345397400144]
We present a novel plug-in dynamic contrastive distillation (DCD) framework to compress image-text retrieval models.
We successfully apply our proposed DCD strategy to two state-of-the-art vision-language pretrained models, i.e. ViLT and METER.
Experiments on MS-COCO and Flickr30K benchmarks show the effectiveness and efficiency of our DCD framework.
arXiv Detail & Related papers (2022-07-04T14:08:59Z) - Efficient Self-supervised Vision Transformers for Representation
Learning [86.57557009109411]
We show that multi-stage architectures with sparse self-attentions can significantly reduce modeling complexity.
We propose a new pre-training task of region matching which allows the model to capture fine-grained region dependencies.
Our results show that combining the two techniques, EsViT achieves 81.3% top-1 on the ImageNet linear probe evaluation.
arXiv Detail & Related papers (2021-06-17T19:57:33Z) - Interpretable Neural Computation for Real-World Compositional Visual
Question Answering [4.3668650778541895]
We build an interpretable framework for real-world compositional VQA.
In our framework, images and questions are disentangled into scene graphs and programs, and a symbolic program runs on them with full transparency to select the attention regions.
Experiments conducted on the GQA benchmark demonstrate that our framework achieves the compositional prior arts and competitive accuracy among monolithic ones.
arXiv Detail & Related papers (2020-10-10T05:46:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.