Utilizing dynamic sparsity on pretrained DETR
- URL: http://arxiv.org/abs/2510.09380v1
- Date: Fri, 10 Oct 2025 13:36:00 GMT
- Title: Utilizing dynamic sparsity on pretrained DETR
- Authors: Reza Sedghi, Anand Subramoney, David Kappel,
- Abstract summary: We analyze the inherent sparsity in the layers of DETR and introduce two methods to exploit it without retraining.<n>First, we propose Static Indicator-Based Sparsification (SIBS), a method that predicts neuron inactivity based on fixed activation patterns.<n>To address this, we introduce Micro-Gated Sparsification (MGS), a lightweight gating mechanism trained on top of a pretrained DETR.
- Score: 0.9332987715848716
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Efficient inference with transformer-based models remains a challenge, especially in vision tasks like object detection. We analyze the inherent sparsity in the MLP layers of DETR and introduce two methods to exploit it without retraining. First, we propose Static Indicator-Based Sparsification (SIBS), a heuristic method that predicts neuron inactivity based on fixed activation patterns. While simple, SIBS offers limited gains due to the input-dependent nature of sparsity. To address this, we introduce Micro-Gated Sparsification (MGS), a lightweight gating mechanism trained on top of a pretrained DETR. MGS predicts dynamic sparsity using a small linear layer and achieves up to 85 to 95% activation sparsity. Experiments on the COCO dataset show that MGS maintains or even improves performance while significantly reducing computation. Our method offers a practical, input-adaptive approach to sparsification, enabling efficient deployment of pretrained vision transformers without full model retraining.
Related papers
- Train Less, Infer Faster: Efficient Model Finetuning and Compression via Structured Sparsity [21.090365337326414]
Finetuning foundation language models (LMs) with billions of parameters is often impractical due to high computational costs, memory requirements, and the risk of overfitting.<n>We propose a scheme for effective finetuning via sparsification using training gates, which requires minimal trainable parameters.<n> Empirical results show it outperforms recent finetuning baselines in efficiency and performance.
arXiv Detail & Related papers (2026-02-09T20:20:29Z) - Sparse Attention Post-Training for Mechanistic Interpretability [55.030850996535776]
We introduce a simple post-training method that makes transformer attention sparse without sacrificing performance.<n>Applying a flexible sparsity regularisation under a constrained-loss objective, we show on models up to 1B parameters that it is possible to retain the original pretraining loss while reducing attention connectivity to $approx 0.3 % of its edges.
arXiv Detail & Related papers (2025-12-05T16:40:08Z) - R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference [77.47238561728459]
R-Sparse is a training-free activation sparsity approach capable of achieving high sparsity levels in advanced LLMs.<n> Experiments on Llama-2/3 and Mistral models across ten diverse tasks demonstrate that R-Sparse achieves comparable performance at 50% model-level sparsity.
arXiv Detail & Related papers (2025-04-28T03:30:32Z) - The Effect of Optimal Self-Distillation in Noisy Gaussian Mixture Model [2.355460994057843]
Self-distillation (SD) has attracted attention as a simple yet powerful approach in machine learning.<n>Despite its widespread use, the mechanisms underlying its effectiveness remain unclear.
arXiv Detail & Related papers (2025-01-27T17:20:48Z) - Refining Salience-Aware Sparse Fine-Tuning Strategies for Language Models [14.68920095399595]
sparsity-based PEFT (SPEFT) introduces trainable sparse adaptations to the weight matrices in the model.<n>We conduct the first systematic evaluation of salience metrics for SPEFT, inspired by zero-cost NAS proxies.<n>We compare static and dynamic masking strategies, finding that static masking, which predetermines non-zero entries before training, delivers efficiency without sacrificing performance.
arXiv Detail & Related papers (2024-12-18T04:14:35Z) - Scaling Laws for Sparsely-Connected Foundation Models [70.41266138010657]
We explore the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets.
We identify the first scaling law describing the relationship between weight sparsity, number of non-zero parameters, and amount of training data.
arXiv Detail & Related papers (2023-09-15T16:29:27Z) - A Theoretical Explanation of Activation Sparsity through Flat Minima and
Adversarial Robustness [29.87592869483743]
A recent empirical observation of activation sparsity in blocks offers an opportunity to drastically reduce computation costs for free.
We propose the notion of sparsity as one source of activation sparsity and a theoretical explanation based on it.
arXiv Detail & Related papers (2023-09-06T13:48:40Z) - Controlled Sparsity via Constrained Optimization or: How I Learned to
Stop Tuning Penalties and Love Constraints [81.46143788046892]
We focus on the task of controlling the level of sparsity when performing sparse learning.
Existing methods based on sparsity-inducing penalties involve expensive trial-and-error tuning of the penalty factor.
We propose a constrained formulation where sparsification is guided by the training objective and the desired sparsity target in an end-to-end fashion.
arXiv Detail & Related papers (2022-08-08T21:24:20Z) - MEST: Accurate and Fast Memory-Economic Sparse Training Framework on the
Edge [72.16021611888165]
This paper proposes a novel Memory-Economic Sparse Training (MEST) framework targeting for accurate and fast execution on edge devices.
The proposed MEST framework consists of enhancements by Elastic Mutation (EM) and Soft Memory Bound (&S)
Our results suggest that unforgettable examples can be identified in-situ even during the dynamic exploration of sparsity masks.
arXiv Detail & Related papers (2021-10-26T21:15:17Z) - Powerpropagation: A sparsity inducing weight reparameterisation [65.85142037667065]
We introduce Powerpropagation, a new weight- parameterisation for neural networks that leads to inherently sparse models.
Models trained in this manner exhibit similar performance, but have a distribution with markedly higher density at zero, allowing more parameters to be pruned safely.
Here, we combine Powerpropagation with a traditional weight-pruning technique as well as recent state-of-the-art sparse-to-sparse algorithms, showing superior performance on the ImageNet benchmark.
arXiv Detail & Related papers (2021-10-01T10:03:57Z) - Attentional-Biased Stochastic Gradient Descent [74.49926199036481]
We present a provable method (named ABSGD) for addressing the data imbalance or label noise problem in deep learning.
Our method is a simple modification to momentum SGD where we assign an individual importance weight to each sample in the mini-batch.
ABSGD is flexible enough to combine with other robust losses without any additional cost.
arXiv Detail & Related papers (2020-12-13T03:41:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.