ACORT: A Compact Object Relation Transformer for Parameter Efficient
Image Captioning
- URL: http://arxiv.org/abs/2202.05451v1
- Date: Fri, 11 Feb 2022 05:10:28 GMT
- Title: ACORT: A Compact Object Relation Transformer for Parameter Efficient
Image Captioning
- Authors: Jia Huei Tan, Ying Hua Tan, Chee Seng Chan, Joon Huang Chuah
- Abstract summary: We present three methods for image captioning model reduction.
Our proposed ACORT models have 3.7x to 21.6x fewer parameters than the baseline model.
Results demonstrate that our ACORT models are competitive against baselines and SOTA approaches.
- Score: 13.659124860884912
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recent research that applies Transformer-based architectures to image
captioning has resulted in state-of-the-art image captioning performance,
capitalising on the success of Transformers on natural language tasks.
Unfortunately, though these models work well, one major flaw is their large
model sizes. To this end, we present three parameter reduction methods for
image captioning Transformers: Radix Encoding, cross-layer parameter sharing,
and attention parameter sharing. By combining these methods, our proposed ACORT
models have 3.7x to 21.6x fewer parameters than the baseline model without
compromising test performance. Results on the MS-COCO dataset demonstrate that
our ACORT models are competitive against baselines and SOTA approaches, with
CIDEr score >=126. Finally, we present qualitative results and ablation studies
to demonstrate the efficacy of the proposed changes further. Code and
pre-trained models are publicly available at
https://github.com/jiahuei/sparse-image-captioning.
Related papers
- Translatotron-V(ison): An End-to-End Model for In-Image Machine Translation [81.45400849638347]
In-image machine translation (IIMT) aims to translate an image containing texts in source language into an image containing translations in target language.
In this paper, we propose an end-to-end IIMT model consisting of four modules.
Our model achieves competitive performance compared to cascaded models with only 70.9% of parameters, and significantly outperforms the pixel-level end-to-end IIMT model.
arXiv Detail & Related papers (2024-07-03T08:15:39Z) - TexIm FAST: Text-to-Image Representation for Semantic Similarity Evaluation using Transformers [2.7651063843287718]
TexIm FAST is a novel methodology for generating fixed-length representations through a self-supervised Variational Auto-Encoder (VAE) for semantic evaluation applying transformers (TexIm FAST)
The pictorial representations allow oblivious inference while retaining the linguistic intricacies, and are potent in cross-modal applications.
The efficacy of TexIm FAST has been extensively analyzed for the task of Semantic Textual Similarity (STS) upon the MSRPC, CNN/ Daily Mail, and XSum data-sets.
arXiv Detail & Related papers (2024-06-06T18:28:50Z) - DiffiT: Diffusion Vision Transformers for Image Generation [88.08529836125399]
Vision Transformer (ViT) has demonstrated strong modeling capabilities and scalability, especially for recognition tasks.
We study the effectiveness of ViTs in diffusion-based generative learning and propose a new model denoted as Diffusion Vision Transformers (DiffiT)
DiffiT is surprisingly effective in generating high-fidelity images with significantly better parameter efficiency.
arXiv Detail & Related papers (2023-12-04T18:57:01Z) - E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning [55.50908600818483]
Fine-tuning large-scale pretrained vision models for new tasks has become increasingly parameter-intensive.
We propose an Effective and Efficient Visual Prompt Tuning (E2VPT) approach for large-scale transformer-based model adaptation.
Our approach outperforms several state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2023-07-25T19:03:21Z) - Bridging Vision and Language Encoders: Parameter-Efficient Tuning for
Referring Image Segmentation [72.27914940012423]
We do an investigation of efficient tuning problems on referring image segmentation.
We propose a novel adapter called Bridger to facilitate cross-modal information exchange.
We also design a lightweight decoder for image segmentation.
arXiv Detail & Related papers (2023-07-21T12:46:15Z) - Learned Image Compression with Mixed Transformer-CNN Architectures [21.53261818914534]
We propose an efficient parallel Transformer-CNN Mixture (TCM) block with a controllable complexity.
Inspired by the recent progress of entropy estimation models and attention modules, we propose a channel-wise entropy model with parameter-efficient swin-transformer-based attention.
Experimental results demonstrate our proposed method achieves state-of-the-art rate-distortion performances.
arXiv Detail & Related papers (2023-03-27T08:19:01Z) - Scaling Autoregressive Models for Content-Rich Text-to-Image Generation [95.02406834386814]
Parti treats text-to-image generation as a sequence-to-sequence modeling problem.
Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens.
PartiPrompts (P2) is a new holistic benchmark of over 1600 English prompts.
arXiv Detail & Related papers (2022-06-22T01:11:29Z) - Parameter-efficient Model Adaptation for Vision Transformers [45.3460867776953]
We study parameter-efficient model adaptation strategies for vision transformers on the image classification task.
We propose a parameter-efficient model adaptation framework, which first selects submodules by measuring local intrinsic dimensions.
Our method performs the best in terms of the tradeoff between accuracy and parameter efficiency across 20 image classification datasets.
arXiv Detail & Related papers (2022-03-29T05:30:09Z) - Generating Images with Sparse Representations [21.27273495926409]
High dimensionality of images presents architecture and sampling-efficiency challenges for likelihood-based generative models.
We present an alternative approach, inspired by common image compression methods like JPEG, and convert images to quantized discrete cosine transform (DCT) blocks.
We propose a Transformer-based autoregressive architecture, which is trained to sequentially predict the conditional distribution of the next element in such sequences.
arXiv Detail & Related papers (2021-03-05T17:56:03Z) - Parameter Efficient Multimodal Transformers for Video Representation
Learning [108.8517364784009]
This work focuses on reducing the parameters of multimodal Transformers in the context of audio-visual video representation learning.
We show that our approach reduces parameters up to 80$%$, allowing us to train our model end-to-end from scratch.
To demonstrate our approach, we pretrain our model on 30-second clips from Kinetics-700 and transfer it to audio-visual classification tasks.
arXiv Detail & Related papers (2020-12-08T00:16:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.