Joint Post-Training Quantization of Vision Transformers with Learned Prompt-Guided Data Generation
- URL: http://arxiv.org/abs/2602.18861v1
- Date: Sat, 21 Feb 2026 15:02:21 GMT
- Title: Joint Post-Training Quantization of Vision Transformers with Learned Prompt-Guided Data Generation
- Authors: Shile Li, Markus Karmann, Onay Urfalioglu,
- Abstract summary: We present a framework for joint quantization of Vision Transformers trained on ImageNet.<n>We achieve state-of-the-art W4A4 and W3A3 accuracies on ImageNet.<n>We also introduce a data-free calibration strategy that synthesizes diverse, label-free samples.
- Score: 5.75627633588113
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a framework for end-to-end joint quantization of Vision Transformers trained on ImageNet for the purpose of image classification. Unlike prior post-training or block-wise reconstruction methods, we jointly optimize over the entire set of all layers and inter-block dependencies without any labeled data, scaling effectively with the number of samples and completing in just one hour on a single GPU for ViT-small. We achieve state-of-the-art W4A4 and W3A3 accuracies on ImageNet and, to the best of our knowledge, the first PTQ results that maintain strong accuracy on ViT, DeiT, and Swin-T models under extremely low-bit settings (W1.58A8), demonstrating the potential for efficient edge deployment. Furthermore, we introduce a data-free calibration strategy that synthesizes diverse, label-free samples using Stable Diffusion Turbo guided by learned multi-mode prompts. By encouraging diversity in both the learned prompt embeddings and the generated image features, our data-free approach achieves performance on par with real-data ImageNet calibration and surpasses simple text-prompt baselines such as "a <adjective> photo of <adjective> <cls>".
Related papers
- Semantic Alignment and Reinforcement for Data-Free Quantization of Vision Transformers [58.80845404416028]
Data-free quantization (DFQ) enables model quantization without accessing real data, addressing concerns regarding data security and privacy.<n>With the growing adoption of Vision Transformers (ViTs), DFQ for ViTs has garnered significant attention.<n>We propose SARDFQ, a novel Semantics Alignment and Reinforcement Data-Free Quantization method for ViTs.
arXiv Detail & Related papers (2024-12-21T09:30:45Z) - Diffusion-Enhanced Test-time Adaptation with Text and Image Augmentation [67.37146712877794]
IT3A is a novel test-time adaptation method that utilizes a pre-trained generative model for multi-modal augmentation of each test sample from unknown new domains.<n>By combining augmented data from pre-trained vision and language models, we enhance the ability of the model to adapt to unknown new test data.<n>In a zero-shot setting, IT3A outperforms state-of-the-art test-time prompt tuning methods with a 5.50% increase in accuracy.
arXiv Detail & Related papers (2024-12-12T20:01:24Z) - Semantic Graph Consistency: Going Beyond Patches for Regularizing Self-Supervised Vision Transformers [5.359378066251386]
Self-supervised learning with vision transformers (ViTs) has proven effective for representation learning.
Existing ViT-based SSL architectures do not fully exploit the ViT backbone.
We introduce a novel Semantic Graph Consistency (SGC) module to regularize ViT-based SSL methods and leverage patch tokens effectively.
arXiv Detail & Related papers (2024-06-18T06:36:44Z) - Leveraging Representations from Intermediate Encoder-blocks for Synthetic Image Detection [13.840950434728533]
State-of-the-art Synthetic Image Detection (SID) research has led to strong evidence on the advantages of feature extraction from foundation models.
We leverage the image representations extracted by intermediate Transformer blocks of CLIP's image-encoder via a lightweight network.
Our method is compared against the state-of-the-art by evaluating it on 20 test datasets and exhibits an average +10.6% absolute performance improvement.
arXiv Detail & Related papers (2024-02-29T12:18:43Z) - Progressive Learning with Visual Prompt Tuning for Variable-Rate Image
Compression [60.689646881479064]
We propose a progressive learning paradigm for transformer-based variable-rate image compression.
Inspired by visual prompt tuning, we use LPM to extract prompts for input images and hidden features at the encoder side and decoder side, respectively.
Our model outperforms all current variable image methods in terms of rate-distortion performance and approaches the state-of-the-art fixed image compression methods trained from scratch.
arXiv Detail & Related papers (2023-11-23T08:29:32Z) - Robust Cross-Modal Representation Learning with Progressive
Self-Distillation [7.676408770854477]
The learning objective of vision-language approach of CLIP does not effectively account for the noisy many-to-many correspondences found in web-harvested image captioning datasets.
We introduce a novel training framework based on cross-modal contrastive learning that uses progressive self-distillation and soft image-text alignments to more efficiently learn robust representations from noisy data.
arXiv Detail & Related papers (2022-04-10T03:28:18Z) - Plug-In Inversion: Model-Agnostic Inversion for Vision with Data
Augmentations [61.95114821573875]
We introduce Plug-In Inversion, which relies on a simple set of augmentations and does not require excessive hyper- parameter tuning.
We illustrate the practicality of our approach by inverting Vision Transformers (ViTs) and Multi-Layer Perceptrons (MLPs) trained on the ImageNet dataset.
arXiv Detail & Related papers (2022-01-31T02:12:45Z) - LiT: Zero-Shot Transfer with Locked-image Text Tuning [68.78877201319811]
"Locked-image Text tuning" (LiT-tuning) teaches a text model to read out good representations from a pre-trained image model for new tasks.
A LiT-tuned model gains the capability of zero-shot transfer to new vision tasks, such as image classification or retrieval.
arXiv Detail & Related papers (2021-11-15T18:53:48Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z) - Scalable Visual Transformers with Hierarchical Pooling [61.05787583247392]
We propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length.
It brings a great benefit by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity.
Our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.
arXiv Detail & Related papers (2021-03-19T03:55:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.