End-to-End Transformer Based Model for Image Captioning
- URL: http://arxiv.org/abs/2203.15350v1
- Date: Tue, 29 Mar 2022 08:47:46 GMT
- Title: End-to-End Transformer Based Model for Image Captioning
- Authors: Yiyu Wang, Jungang Xu, Yingfei Sun
- Abstract summary: Transformer-based model integrates image captioning into one stage and realizes end-to-end training.
Model achieves new state-of-the-art performances of 138.2% (single model) and 141.0% (ensemble of 4 models)
- Score: 1.4303104706989949
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: CNN-LSTM based architectures have played an important role in image
captioning, but limited by the training efficiency and expression ability,
researchers began to explore the CNN-Transformer based models and achieved
great success. Meanwhile, almost all recent works adopt Faster R-CNN as the
backbone encoder to extract region-level features from given images. However,
Faster R-CNN needs a pre-training on an additional dataset, which divides the
image captioning task into two stages and limits its potential applications. In
this paper, we build a pure Transformer-based model, which integrates image
captioning into one stage and realizes end-to-end training. Firstly, we adopt
SwinTransformer to replace Faster R-CNN as the backbone encoder to extract
grid-level features from given images; Then, referring to Transformer, we build
a refining encoder and a decoder. The refining encoder refines the grid
features by capturing the intra-relationship between them, and the decoder
decodes the refined features into captions word by word. Furthermore, in order
to increase the interaction between multi-modal (vision and language) features
to enhance the modeling capability, we calculate the mean pooling of grid
features as the global feature, then introduce it into refining encoder to
refine with grid features together, and add a pre-fusion process of refined
global feature and generated words in decoder. To validate the effectiveness of
our proposed model, we conduct experiments on MSCOCO dataset. The experimental
results compared to existing published works demonstrate that our model
achieves new state-of-the-art performances of 138.2% (single model) and 141.0%
(ensemble of 4 models) CIDEr scores on `Karpathy' offline test split and 136.0%
(c5) and 138.3% (c40) CIDEr scores on the official online test server. Trained
models and source code will be released.
Related papers
- Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction
Tuning [115.50132185963139]
CM3Leon is a decoder-only multi-modal language model capable of generating and infilling both text and images.
It is the first multi-modal model trained with a recipe adapted from text-only language models.
CM3Leon achieves state-of-the-art performance in text-to-image generation with 5x less training compute than comparable methods.
arXiv Detail & Related papers (2023-09-05T21:27:27Z) - An Image captioning algorithm based on the Hybrid Deep Learning
Technique (CNN+GRU) [0.0]
We present a CNN-GRU encoder decode framework for caption-to-image reconstructor.
It handles the semantic context into consideration as well as the time complexity.
The suggested model outperforms the state-of-the-art LSTM-A5 model for picture captioning in terms of time complexity and accuracy.
arXiv Detail & Related papers (2023-01-06T10:00:06Z) - ConvTransSeg: A Multi-resolution Convolution-Transformer Network for
Medical Image Segmentation [14.485482467748113]
We propose a hybrid encoder-decoder segmentation model (ConvTransSeg)
It consists of a multi-layer CNN as the encoder for feature learning and the corresponding multi-level Transformer as the decoder for segmentation prediction.
Our method achieves the best performance in terms of Dice coefficient and average symmetric surface distance measures with low model complexity and memory consumption.
arXiv Detail & Related papers (2022-10-13T14:59:23Z) - An End-to-End OCR Framework for Robust Arabic-Handwriting Recognition
using a Novel Transformers-based Model and an Innovative 270 Million-Words
Multi-Font Corpus of Classical Arabic with Diacritics [0.0]
This research is the second phase in a series of investigations on developing an Optical Character Recognition (OCR) of Arabic historical documents.
We propose an end-to-end text recognition approach using Vision Transformers as an encoder, namely BEIT, and vanilla Transformer as a decoder, eliminating CNNs for feature extraction and reducing the model's complexity.
arXiv Detail & Related papers (2022-08-20T22:21:19Z) - Scaling Autoregressive Models for Content-Rich Text-to-Image Generation [95.02406834386814]
Parti treats text-to-image generation as a sequence-to-sequence modeling problem.
Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens.
PartiPrompts (P2) is a new holistic benchmark of over 1600 English prompts.
arXiv Detail & Related papers (2022-06-22T01:11:29Z) - GIT: A Generative Image-to-text Transformer for Vision and Language [138.91581326369837]
We train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering.
Our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr)
arXiv Detail & Related papers (2022-05-27T17:03:38Z) - Neural Data-Dependent Transform for Learned Image Compression [72.86505042102155]
We build a neural data-dependent transform and introduce a continuous online mode decision mechanism to jointly optimize the coding efficiency for each individual image.
The experimental results show the effectiveness of the proposed neural-syntax design and the continuous online mode decision mechanism.
arXiv Detail & Related papers (2022-03-09T14:56:48Z) - Injecting Semantic Concepts into End-to-End Image Captioning [61.41154537334627]
We propose a pure vision transformer-based image captioning model, dubbed as ViTCAP, in which grid representations are used without extracting the regional features.
For improved performance, we introduce a novel Concept Token Network (CTN) to predict the semantic concepts and then incorporate them into the end-to-end captioning.
In particular, the CTN is built on the basis of a vision transformer and is designed to predict the concept tokens through a classification task.
arXiv Detail & Related papers (2021-12-09T22:05:05Z) - Improved Bengali Image Captioning via deep convolutional neural network
based encoder-decoder model [0.8793721044482612]
This paper presents an end-to-end image captioning system utilizing a multimodal architecture.
Our approach's language encoder captures the fine-grained information in the caption, and combined with the image features, it generates accurate and diversified caption.
arXiv Detail & Related papers (2021-02-14T16:44:17Z) - Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective
with Transformers [149.78470371525754]
We treat semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer to encode an image as a sequence of patches.
With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR)
SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes.
arXiv Detail & Related papers (2020-12-31T18:55:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.