COMCAT: Towards Efficient Compression and Customization of
Attention-Based Vision Models
- URL: http://arxiv.org/abs/2305.17235v2
- Date: Fri, 9 Jun 2023 16:11:21 GMT
- Title: COMCAT: Towards Efficient Compression and Customization of
Attention-Based Vision Models
- Authors: Jinqi Xiao, Miao Yin, Yu Gong, Xiao Zang, Jian Ren, Bo Yuan
- Abstract summary: This paper explores an efficient method for compressing vision transformers to enrich the toolset for obtaining compact attention-based vision models.
For compressing DeiT-small and DeiT-base models on ImageNet, our proposed approach can achieve 0.45% and 0.76% higher top-1 accuracy even with fewer parameters.
- Score: 21.07857091998763
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Attention-based vision models, such as Vision Transformer (ViT) and its
variants, have shown promising performance in various computer vision tasks.
However, these emerging architectures suffer from large model sizes and high
computational costs, calling for efficient model compression solutions. To
date, pruning ViTs has been well studied, while other compression strategies
that have been widely applied in CNN compression, e.g., model factorization, is
little explored in the context of ViT compression. This paper explores an
efficient method for compressing vision transformers to enrich the toolset for
obtaining compact attention-based vision models. Based on the new insight on
the multi-head attention layer, we develop a highly efficient ViT compression
solution, which outperforms the state-of-the-art pruning methods. For
compressing DeiT-small and DeiT-base models on ImageNet, our proposed approach
can achieve 0.45% and 0.76% higher top-1 accuracy even with fewer parameters.
Our finding can also be applied to improve the customization efficiency of
text-to-image diffusion models, with much faster training (up to $2.6\times$
speedup) and lower extra storage cost (up to $1927.5\times$ reduction) than the
existing works.
Related papers
- Dense Vision Transformer Compression with Few Samples [20.45895466934069]
Few-shot model compression aims to compress a large model into a more compact one with only a tiny training set (even without labels)
This paper proposes a novel framework for few-shot ViT compression named DC-ViT.
arXiv Detail & Related papers (2024-03-27T15:56:42Z) - Progressive Learning with Visual Prompt Tuning for Variable-Rate Image
Compression [60.689646881479064]
We propose a progressive learning paradigm for transformer-based variable-rate image compression.
Inspired by visual prompt tuning, we use LPM to extract prompts for input images and hidden features at the encoder side and decoder side, respectively.
Our model outperforms all current variable image methods in terms of rate-distortion performance and approaches the state-of-the-art fixed image compression methods trained from scratch.
arXiv Detail & Related papers (2023-11-23T08:29:32Z) - CAIT: Triple-Win Compression towards High Accuracy, Fast Inference, and
Favorable Transferability For ViTs [79.54107547233625]
Vision Transformers (ViTs) have emerged as state-of-the-art models for various vision tasks.
We propose a joint compression method for ViTs that offers both high accuracy and fast inference speed.
Our proposed method can achieve state-of-the-art performance across various ViTs.
arXiv Detail & Related papers (2023-09-27T16:12:07Z) - GOHSP: A Unified Framework of Graph and Optimization-based Heterogeneous
Structured Pruning for Vision Transformer [76.2625311630021]
Vision transformers (ViTs) have shown very impressive empirical performance in various computer vision tasks.
To mitigate this challenging problem, structured pruning is a promising solution to compress model size and enable practical efficiency.
We propose GOHSP, a unified framework of Graph and Optimization-based Structured Pruning for ViT models.
arXiv Detail & Related papers (2023-01-13T00:40:24Z) - Estimating the Resize Parameter in End-to-end Learned Image Compression [50.20567320015102]
We describe a search-free resizing framework that can further improve the rate-distortion tradeoff of recent learned image compression models.
Our results show that our new resizing parameter estimation framework can provide Bjontegaard-Delta rate (BD-rate) improvement of about 10% against leading perceptual quality engines.
arXiv Detail & Related papers (2022-04-26T01:35:02Z) - ELIC: Efficient Learned Image Compression with Unevenly Grouped
Space-Channel Contextual Adaptive Coding [9.908820641439368]
We propose an efficient model, ELIC, to achieve state-of-the-art speed and compression ability.
With superior performance, the proposed model also supports extremely fast preview decoding and progressive decoding.
arXiv Detail & Related papers (2022-03-21T11:19:50Z) - Unified Visual Transformer Compression [102.26265546836329]
This paper proposes a unified ViT compression framework that seamlessly assembles three effective techniques: pruning, layer skipping, and knowledge distillation.
We formulate a budget-constrained, end-to-end optimization framework, targeting jointly learning model weights, layer-wise pruning ratios/masks, and skip configurations.
Experiments are conducted with several ViT variants, e.g. DeiT and T2T-ViT backbones on the ImageNet dataset, and our approach consistently outperforms recent competitors.
arXiv Detail & Related papers (2022-03-15T20:38:22Z) - Multi-Dimensional Model Compression of Vision Transformer [21.8311401851523]
Vision transformers (ViT) have recently attracted considerable attentions, but the huge computational cost remains an issue for practical deployment.
Previous ViT pruning methods tend to prune the model along one dimension solely.
We advocate a multi-dimensional ViT compression paradigm, and propose to harness the redundancy reduction from attention head, neuron and sequence dimensions jointly.
arXiv Detail & Related papers (2021-12-31T19:54:18Z) - Learned Image Compression for Machine Perception [17.40776913809306]
We develop a framework that produces a compression format suitable for both human perception and machine perception.
We show that representations can be learned that simultaneously optimize for compression and performance on core vision tasks.
arXiv Detail & Related papers (2021-11-03T14:39:09Z) - Variable-Rate Deep Image Compression through Spatially-Adaptive Feature
Transform [58.60004238261117]
We propose a versatile deep image compression network based on Spatial Feature Transform (SFT arXiv:1804.02815)
Our model covers a wide range of compression rates using a single model, which is controlled by arbitrary pixel-wise quality maps.
The proposed framework allows us to perform task-aware image compressions for various tasks.
arXiv Detail & Related papers (2021-08-21T17:30:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.