Related papers: Decorrelation Speeds Up Vision Transformers

Decorrelation Speeds Up Vision Transformers

URL: http://arxiv.org/abs/2510.14657v1
Date: Thu, 16 Oct 2025 13:13:12 GMT
Title: Decorrelation Speeds Up Vision Transformers
Authors: Kieran Carrigg, Rob van Gastel, Melda Yeghaian, Sander Dalm, Faysal Boughorbel, Marcel van Gerven,
Abstract summary: Masked Autoencoder (MAE) pre-training of vision transformers (ViTs) yields strong performance in low-label regimes but comes with substantial computational costs.<n>We address this by integrating Decorrelated Backpropagation (DBP) into MAE pre-training, an optimization method that iteratively reduces input correlations at each layer to accelerate convergence.<n>On ImageNet-1K pre-training with ADE20K fine-tuning, DBP-MAE reduces wall-clock time to baseline performance by 21.1%, lowers carbon emissions by 21.4% and improves segmentation mIoU by 1.1
Score: 0.6924349411126935
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Masked Autoencoder (MAE) pre-training of vision transformers (ViTs) yields strong performance in low-label regimes but comes with substantial computational costs, making it impractical in time- and resource-constrained industrial settings. We address this by integrating Decorrelated Backpropagation (DBP) into MAE pre-training, an optimization method that iteratively reduces input correlations at each layer to accelerate convergence. Applied selectively to the encoder, DBP achieves faster pre-training without loss of stability. On ImageNet-1K pre-training with ADE20K fine-tuning, DBP-MAE reduces wall-clock time to baseline performance by 21.1%, lowers carbon emissions by 21.4% and improves segmentation mIoU by 1.1 points. We observe similar gains when pre-training and fine-tuning on proprietary industrial data, confirming the method's applicability in real-world scenarios. These results demonstrate that DBP can reduce training time and energy use while improving downstream performance for large-scale ViT pre-training.

Related papers

Breaking the Limits of Open-Weight CLIP: An Optimization Framework for Self-supervised Fine-tuning of CLIP [60.025820738301434]
TuneCLIP is a self-supervised fine-tuning framework for CLIP models.<n>It consistently improves performance across model architectures and scales.<n>It elevates leading open-weight models like SigLIP (ViT-B/16), achieving gains of up to +2.5% on ImageNet and related out-of-distribution benchmarks.
arXiv Detail & Related papers (2026-01-14T20:38:36Z)
MeanFlow Transformers with Representation Autoencoders [71.45823902973349]
MeanFlow (MF) is a diffusion-motivated generative model that enables efficient few-step generation by learning long jumps directly from noise to data.<n>We develop an efficient training and sampling scheme for MF in the latent space of a Representation Autoencoder (RAE)<n>We achieve a 1-step FID of 2.03, outperforming vanilla MF's 3.43, while reducing sampling GFLOPS by 38% and total training cost by 83% on ImageNet 256.
arXiv Detail & Related papers (2025-11-17T06:17:08Z)
EcoSpa: Efficient Transformer Training with Coupled Sparsity [79.5008521101473]
Transformers have become the backbone of modern AI, yet their high computational demands pose critical system challenges.<n>We introduce EcoSpa, an efficient structured sparse training method that jointly evaluates and sparsifies coupled weight matrix pairs.
arXiv Detail & Related papers (2025-11-09T11:23:43Z)
Sprint: Sparse-Dense Residual Fusion for Efficient Diffusion Transformers [91.02299679350834]
Diffusion Transformers (DiTs) deliver state-of-the-art generative performance but their quadratic training cost with sequence length makes large-scale pretraining prohibitively expensive.<n>We present Sparse--Dense Residual Fusion for Efficient Diffusion Transformers, a simple method that enables aggressive token dropping (up to 75%) while preserving quality.
arXiv Detail & Related papers (2025-10-24T19:29:55Z)
IIET: Efficient Numerical Transformer via Implicit Iterative Euler Method [59.02943805284446]
Iterative Implicit Euler Transformer (IIET)<n>IIAD allows users to effectively balance the performance-efficiency trade-off.<n>E-IIET variant achieves an average performance gain exceeding 1.6% over vanilla Transformer with comparable speed.
arXiv Detail & Related papers (2025-09-26T15:14:03Z)
Boosted Training of Lightweight Early Exits for Optimizing CNN Image Classification Inference [47.027290803102666]
We introduce a sequential training approach that aligns branch training with inference-time data distributions.<n>Experiments on the CINIC-10 dataset with a ResNet18 backbone demonstrate that BTS-EE consistently outperforms non-boosted training.<n>These results offer practical efficiency gains for applications such as industrial inspection, embedded vision, and UAV-based monitoring.
arXiv Detail & Related papers (2025-09-10T06:47:49Z)
ElasticZO: A Memory-Efficient On-Device Learning with Combined Zeroth- and First-Order Optimization [0.9444784653236158]
We propose ZO-based on-device learning (ODL) methods for full-precision and 8-bit quantized deep neural networks (DNNs)<n> ElasticZO achieves 5.2-9.5% higher accuracy with only 0.072-1.7% memory overhead, and can handle fine-tuning tasks as well as full training.<n> ElasticZO-INT8 achieves integer arithmetic-only ZO-based training for the first time, by incorporating a novel method for computing quantized ZO gradients from integer cross-entropy loss values.
arXiv Detail & Related papers (2025-01-08T05:25:14Z)
BEExformer: A Fast Inferencing Binarized Transformer with Early Exits [2.7651063843287718]
We introduce Binarized Early Exit Transformer (BEExformer), the first-ever selective learning-based transformer integrating Binarization-Aware Training (BAT) with Early Exit (EE)<n>BAT employs a differentiable second-order approximation to the sign function, enabling gradient that captures both the sign and magnitude of the weights.<n>EE mechanism hinges on fractional reduction in entropy among intermediate transformer blocks with soft-routing loss estimation.<n>This accelerates inference by reducing FLOPs by 52.08% and even improves accuracy by 2.89% by resolving the "overthinking" problem inherent in deep networks.
arXiv Detail & Related papers (2024-12-06T17:58:14Z)
Video Prediction Transformers without Recurrence or Convolution [65.93130697098658]
We propose PredFormer, a framework entirely based on Gated Transformers.<n>We provide a comprehensive analysis of 3D Attention in the context of video prediction.<n>The significant improvements in both accuracy and efficiency highlight the potential of PredFormer.
arXiv Detail & Related papers (2024-10-07T03:52:06Z)
Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning [63.43972993473501]
Token compression expedites the training and inference of Vision Transformers (ViTs) However, when applied to downstream tasks, compression degrees are mismatched between training and inference stages. We propose a model arithmetic framework to decouple the compression degrees between the two stages.
arXiv Detail & Related papers (2024-08-13T10:36:43Z)
Faster Image2Video Generation: A Closer Look at CLIP Image Embedding's Impact on Spatio-Temporal Cross-Attentions [27.111140222002653]
This paper investigates the role of CLIP image embeddings within the Stable Video Diffusion (SVD) framework. We introduce the VCUT, a training-free approach optimized for efficiency within the SVD architecture. The implementation of VCUT leads to a reduction of up to 322T Multiple-Accumulate Operations (MACs) per video and a decrease in model parameters by up to 50M, achieving a 20% reduction in latency compared to the baseline.
arXiv Detail & Related papers (2024-07-27T08:21:14Z)
HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer Compression [69.36555801766762]
We propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible decompositions. We experimentally show that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss.
arXiv Detail & Related papers (2022-11-30T05:31:45Z)
Online Convolutional Re-parameterization [51.97831675242173]
We present online convolutional re- parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution. Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2x. We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks.
arXiv Detail & Related papers (2022-04-02T09:50:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.