Understanding vision transformer robustness through the lens of out-of-distribution detection
- URL: http://arxiv.org/abs/2602.01459v1
- Date: Sun, 01 Feb 2026 22:00:59 GMT
- Title: Understanding vision transformer robustness through the lens of out-of-distribution detection
- Authors: Joey Kuang, Alexander Wong,
- Abstract summary: Quantization reduces memory and inference costs at the risk of performance loss.<n>We investigate the behaviour of quantized small-variant popular vision transformers (DeiT, DeiT3, and ViT) on common out-of-distribution (OOD) datasets.
- Score: 59.72757235382676
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision transformers have shown remarkable performance in vision tasks, but enabling them for accessible and real-time use is still challenging. Quantization reduces memory and inference costs at the risk of performance loss. Strides have been made to mitigate low precision issues mainly by understanding in-distribution (ID) task behaviour, but the attention mechanism may provide insight on quantization attributes by exploring out-of-distribution (OOD) situations. We investigate the behaviour of quantized small-variant popular vision transformers (DeiT, DeiT3, and ViT) on common OOD datasets. ID analyses show the initial instabilities of 4-bit models, particularly of those trained on the larger ImageNet-22k, as the strongest FP32 model, DeiT3, sharply drop 17% from quantization error to be one of the weakest 4-bit models. While ViT shows reasonable quantization robustness for ID calibration, OOD detection reveals more: ViT and DeiT3 pretrained on ImageNet-22k respectively experienced a 15.0% and 19.2% average quantization delta in AUPR-out between full precision to 4-bit while their ImageNet-1k-only counterparts experienced a 9.5% and 12.0% delta. Overall, our results suggest pretraining on large scale datasets may hinder low-bit quantization robustness in OOD detection and that data augmentation may be a more beneficial option.
Related papers
- HEART-VIT: Hessian-Guided Efficient Dynamic Attention and Token Pruning in Vision Transformer [3.652580364273503]
We introduce HEART-ViT, a Hessian-guided efficient dynamic attention and token pruning framework for vision transformers.<n> HEART-ViT estimates curvature-weighted sensitivities of both tokens and attention heads using efficient Hessian-vector products.<n>On ImageNet-100 and ImageNet-1K with ViT-B/16 and DeiT-B/16, HEART-ViT achieves up to 49.4 percent FLOPs reduction, 36 percent lower latency, and 46 percent higher throughput.
arXiv Detail & Related papers (2025-12-23T07:23:16Z) - ADFQ-ViT: Activation-Distribution-Friendly Post-Training Quantization for Vision Transformers [7.155242379236052]
Quantization of Vision Transformers (ViTs) has emerged as a promising solution to mitigate these challenges.
Existing methods still suffer from significant accuracy loss at low-bit.
ADFQ-ViT provides significant improvements over various baselines in image classification, object detection, and instance segmentation tasks at 4-bit.
arXiv Detail & Related papers (2024-07-03T02:41:59Z) - On Calibration of Modern Quantized Efficient Neural Networks [79.06893963657335]
Quality of calibration is observed to track the quantization quality.
GhostNet-VGG is shown to be the most robust to overall performance drop at lower precision.
arXiv Detail & Related papers (2023-09-25T04:30:18Z) - QD-BEV : Quantization-aware View-guided Distillation for Multi-view 3D
Object Detection [57.019527599167255]
Multi-view 3D detection based on BEV (bird-eye-view) has recently achieved significant improvements.
We show in our paper that directly applying quantization in BEV tasks will 1) make the training unstable, and 2) lead to intolerable performance degradation.
Our method QD-BEV enables a novel view-guided distillation (VGD) objective, which can stabilize the quantization-aware training (QAT) while enhancing the model performance.
arXiv Detail & Related papers (2023-08-21T07:06:49Z) - Pushing the Limits of Fewshot Anomaly Detection in Industry Vision:
Graphcore [71.09522172098733]
We utilize graph representation in FSAD and provide a novel visual invariant feature (VIIF) as anomaly measurement feature.
VIIF can robustly improve the anomaly discriminating ability and can further reduce the size of redundant features stored in M.
Besides, we provide a novel model GraphCore via VIIFs that can fast implement unsupervised FSAD training and can improve the performance of anomaly detection.
arXiv Detail & Related papers (2023-01-28T03:58:32Z) - Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer [56.87383229709899]
We develop an information rectification module (IRM) and a distribution guided distillation scheme for fully quantized vision transformers (Q-ViT)
Our method achieves a much better performance than the prior arts.
arXiv Detail & Related papers (2022-10-13T04:00:29Z) - Sharpness-aware Quantization for Deep Neural Networks [45.150346855368]
Sharpness-Aware Quantization (SAQ) is a novel method to explore the effect of Sharpness-Aware Minimization (SAM) on model compression.
We show that SAQ improves the generalization performance of the quantized models, yielding the SOTA results in uniform quantization.
arXiv Detail & Related papers (2021-11-24T05:16:41Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.