End-to-End Supermask Pruning: Learning to Prune Image Captioning Models
        - URL: http://arxiv.org/abs/2110.03298v1
- Date: Thu, 7 Oct 2021 09:34:00 GMT
- Title: End-to-End Supermask Pruning: Learning to Prune Image Captioning Models
- Authors: Jia Huei Tan, Chee Seng Chan, Joon Huang Chuah
- Abstract summary: We show that an 80% to 95% sparse network can either match or outperform its dense counterpart.
The code and pre-trained models for Up-Down and Object Relation Transformer are capable of achieving CIDEr scores >120 on the MS-COCO dataset.
- Score: 17.00974730372399
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract:   With the advancement of deep models, research work on image captioning has
led to a remarkable gain in raw performance over the last decade, along with
increasing model complexity and computational cost. However, surprisingly works
on compression of deep networks for image captioning task has received little
to no attention. For the first time in image captioning research, we provide an
extensive comparison of various unstructured weight pruning methods on three
different popular image captioning architectures, namely Soft-Attention,
Up-Down and Object Relation Transformer. Following this, we propose a novel
end-to-end weight pruning method that performs gradual sparsification based on
weight sensitivity to the training loss. The pruning schemes are then extended
with encoder pruning, where we show that conducting both decoder pruning and
training simultaneously prior to the encoder pruning provides good overall
performance. Empirically, we show that an 80% to 95% sparse network (up to 75%
reduction in model size) can either match or outperform its dense counterpart.
The code and pre-trained models for Up-Down and Object Relation Transformer
that are capable of achieving CIDEr scores >120 on the MS-COCO dataset but with
only 8.7 MB and 14.5 MB in model size (size reduction of 96% and 94%
respectively against dense versions) are publicly available at
https://github.com/jiahuei/sparse-image-captioning.
 
      
        Related papers
        - Perception Encoder: The best visual embeddings are not at the output of   the network [70.86738083862099]
 We introduce Perception (PE), a vision encoder for image and video understanding trained via simple vision-language learning.
We find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks.
Together, our PE family of models achieves best-in-class results on a wide variety of tasks.
 arXiv  Detail & Related papers  (2025-04-17T17:59:57Z)
- Efficient Pruning of Text-to-Image Models: Insights from Pruning Stable   Diffusion [3.399289369740637]
 This paper presents a pioneering study on post-training pruning of Stable Diffusion 2.
It addresses the critical need for model compression in text-to-image domain.
We propose an optimal pruning configuration that prunes the text encoder to 47.5% and the diffusion generator to 35%.
 arXiv  Detail & Related papers  (2024-11-22T18:29:37Z)
- LiteNeXt: A Novel Lightweight ConvMixer-based Model with Self-embedding   Representation Parallel for Medical Image Segmentation [2.0901574458380403]
 We propose a new lightweight but efficient model, namely LiteNeXt, for medical image segmentation.
LiteNeXt is trained from scratch with small amount of parameters (0.71M) and Giga Floating Point Operations Per Second (0.42).
 arXiv  Detail & Related papers  (2024-04-04T01:59:19Z)
- Reducing The Amortization Gap of Entropy Bottleneck In End-to-End Image
  Compression [2.1485350418225244]
 End-to-end deep trainable models are about to exceed the performance of the traditional handcrafted compression techniques on videos and images.
We propose a simple yet efficient instance-based parameterization method to reduce this amortization gap at a minor cost.
 arXiv  Detail & Related papers  (2022-09-02T11:43:45Z)
- BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers [117.79456335844439]
 We propose to use a semantic-rich visual tokenizer as the reconstruction target for masked prediction.
We then pretrain vision Transformers by predicting the original visual tokens for the masked image patches.
Experiments on image classification and semantic segmentation show that our approach outperforms all compared MIM methods.
 arXiv  Detail & Related papers  (2022-08-12T16:48:10Z)
- CrAM: A Compression-Aware Minimizer [103.29159003723815]
 We propose a new compression-aware minimizer dubbed CrAM that modifies the optimization step in a principled way.
CrAM produces dense models that can be more accurate than the standard SGD/Adam-based baselines, but which are stable under weight pruning.
CrAM can produce sparse models which perform well for transfer learning, and it also works for semi-structured 2:4 pruning patterns supported by GPU hardware.
 arXiv  Detail & Related papers  (2022-07-28T16:13:28Z)
- Estimating the Resize Parameter in End-to-end Learned Image Compression [50.20567320015102]
 We describe a search-free resizing framework that can further improve the rate-distortion tradeoff of recent learned image compression models.
Our results show that our new resizing parameter estimation framework can provide Bjontegaard-Delta rate (BD-rate) improvement of about 10% against leading perceptual quality engines.
 arXiv  Detail & Related papers  (2022-04-26T01:35:02Z)
- Structured Pruning is All You Need for Pruning CNNs at Initialization [38.88730369884401]
 Pruning is a popular technique for reducing the model size and computational cost of convolutional neural networks (CNNs)
We propose PreCropping, a structured hardware-efficient model compression scheme.
Compared to weight pruning, the proposed scheme is regular and dense in both storage and computation without sacrificing accuracy.
 arXiv  Detail & Related papers  (2022-03-04T19:54:31Z)
- Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
 Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
 Coupling these two designs enables us to train large models efficiently and effectively.
 arXiv  Detail & Related papers  (2021-11-11T18:46:40Z)
- Effective Model Sparsification by Scheduled Grow-and-Prune Methods [73.03533268740605]
 We propose a novel scheduled grow-and-prune (GaP) methodology without pre-training the dense models.
Experiments have shown that such models can match or beat the quality of highly optimized dense models at 80% sparsity on a variety of tasks.
 arXiv  Detail & Related papers  (2021-06-18T01:03:13Z)
- Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
 We tackle the problem of producing compact models, maximizing their accuracy for a given model size.
A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator.
In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
 arXiv  Detail & Related papers  (2020-04-15T20:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.