PSAQ-ViT V2: Towards Accurate and General Data-Free Quantization for
Vision Transformers
- URL: http://arxiv.org/abs/2209.05687v2
- Date: Mon, 31 Jul 2023 03:14:23 GMT
- Title: PSAQ-ViT V2: Towards Accurate and General Data-Free Quantization for
Vision Transformers
- Authors: Zhikai Li, Mengjuan Chen, Junrui Xiao, and Qingyi Gu
- Abstract summary: Data-free quantization can potentially address data privacy and security concerns in model compression.
Recently, PSAQ-ViT designs a relative value metric, patch similarity, to generate data from pre-trained vision transformers (ViTs)
In this paper, we propose PSAQ-ViT V2, a more accurate and general data-free quantization framework for ViTs.
- Score: 2.954890575035673
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data-free quantization can potentially address data privacy and security
concerns in model compression, and thus has been widely investigated. Recently,
PSAQ-ViT designs a relative value metric, patch similarity, to generate data
from pre-trained vision transformers (ViTs), achieving the first attempt at
data-free quantization for ViTs. In this paper, we propose PSAQ-ViT V2, a more
accurate and general data-free quantization framework for ViTs, built on top of
PSAQ-ViT. More specifically, following the patch similarity metric in PSAQ-ViT,
we introduce an adaptive teacher-student strategy, which facilitates the
constant cyclic evolution of the generated samples and the quantized model
(student) in a competitive and interactive fashion under the supervision of the
full-precision model (teacher), thus significantly improving the accuracy of
the quantized model. Moreover, without the auxiliary category guidance, we
employ the task- and model-independent prior information, making the
general-purpose scheme compatible with a broad range of vision tasks and
models. Extensive experiments are conducted on various models on image
classification, object detection, and semantic segmentation tasks, and PSAQ-ViT
V2, with the naive quantization strategy and without access to real-world data,
consistently achieves competitive results, showing potential as a powerful
baseline on data-free quantization for ViTs. For instance, with Swin-S as the
(backbone) model, 8-bit quantization reaches 82.13 top-1 accuracy on ImageNet,
50.9 box AP and 44.1 mask AP on COCO, and 47.2 mIoU on ADE20K. We hope that
accurate and general PSAQ-ViT V2 can serve as a potential and practice solution
in real-world applications involving sensitive data. Code is released and
merged at: https://github.com/zkkli/PSAQ-ViT.
Related papers
- CLAMP-ViT: Contrastive Data-Free Learning for Adaptive Post-Training Quantization of ViTs [6.456189487006878]
We present CLAMP-ViT, a data-free post-training quantization method for vision transformers (ViTs)
We identify the limitations of recent techniques, notably their inability to leverage meaningful inter-patch relationships.
CLAMP-ViT employs a two-stage approach, cyclically adapting between data generation and model quantization.
arXiv Detail & Related papers (2024-07-07T05:39:25Z) - MPTQ-ViT: Mixed-Precision Post-Training Quantization for Vision
Transformer [7.041718444626999]
We propose a mixed-precision post-training quantization framework for vision transformers (MPTQ-ViT)
Our experiments on ViT, DeiT, and Swin demonstrate significant accuracy improvements compared with SOTA on the ImageNet dataset.
arXiv Detail & Related papers (2024-01-26T14:25:15Z) - DAT++: Spatially Dynamic Vision Transformer with Deformable Attention [87.41016963608067]
We present Deformable Attention Transformer ( DAT++), a vision backbone efficient and effective for visual recognition.
DAT++ achieves state-of-the-art results on various visual recognition benchmarks, with 85.9% ImageNet accuracy, 54.5 and 47.0 MS-COCO instance segmentation mAP, and 51.5 ADE20K semantic segmentation mIoU.
arXiv Detail & Related papers (2023-09-04T08:26:47Z) - Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design [84.34416126115732]
Scaling laws have been recently employed to derive compute-optimal model size (number of parameters) for a given compute duration.
We advance and refine such methods to infer compute-optimal model shapes, such as width and depth, and successfully implement this in vision transformers.
Our shape-optimized vision transformer, SoViT, achieves results competitive with models that exceed twice its size, despite being pre-trained with an equivalent amount of compute.
arXiv Detail & Related papers (2023-05-22T13:39:28Z) - ViTPose++: Vision Transformer for Generic Body Pose Estimation [70.86760562151163]
We show the surprisingly good properties of plain vision transformers for body pose estimation from various aspects.
ViTPose employs the plain and non-hierarchical vision transformer as an encoder to encode features and a lightweight decoder to decode body keypoints.
We empirically demonstrate that the knowledge of large ViTPose models can be easily transferred to small ones via a simple knowledge token.
arXiv Detail & Related papers (2022-12-07T12:33:28Z) - Patch Similarity Aware Data-Free Quantization for Vision Transformers [2.954890575035673]
We propose PSAQ-ViT, a Patch Similarity Aware data-free Quantization framework for Vision Transformers.
We analyze the self-attention module's properties and reveal a general difference (patch similarity) in its processing of Gaussian noise and real images.
Experiments and ablation studies are conducted on various benchmarks to validate the effectiveness of PSAQ-ViT.
arXiv Detail & Related papers (2022-03-04T11:47:20Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - Vector-quantized Image Modeling with Improved VQGAN [93.8443646643864]
We propose a Vector-quantized Image Modeling approach that involves pretraining a Transformer to predict image tokens autoregressively.
We first propose multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity.
When trained on ImageNet at 256x256 resolution, we achieve Inception Score (IS) of 175.1 and Frechet Inception Distance (FID) of 4.17, a dramatic improvement over the vanilla VQGAN.
arXiv Detail & Related papers (2021-10-09T18:36:00Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.