Image Captioning In the Transformer Age
- URL: http://arxiv.org/abs/2204.07374v1
- Date: Fri, 15 Apr 2022 08:13:39 GMT
- Title: Image Captioning In the Transformer Age
- Authors: Yang Xu, Li Li, Haiyang Xu, Songfang Huang, Fei Huang, Jianfei Cai
- Abstract summary: Image Captioning (IC) has achieved astonishing developments by incorporating various techniques into the CNN-RNN encoder-decoder architecture.
This paper analyzes the connections between IC with some popular self-supervised learning paradigms.
- Score: 71.06437715212911
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image Captioning (IC) has achieved astonishing developments by incorporating
various techniques into the CNN-RNN encoder-decoder architecture. However,
since CNN and RNN do not share the basic network component, such a
heterogeneous pipeline is hard to be trained end-to-end where the visual
encoder will not learn anything from the caption supervision. This drawback
inspires the researchers to develop a homogeneous architecture that facilitates
end-to-end training, for which Transformer is the perfect one that has proven
its huge potential in both vision and language domains and thus can be used as
the basic component of the visual encoder and language decoder in an IC
pipeline. Meantime, self-supervised learning releases the power of the
Transformer architecture that a pre-trained large-scale one can be generalized
to various tasks including IC. The success of these large-scale models seems to
weaken the importance of the single IC task. However, we demonstrate that IC
still has its specific significance in this age by analyzing the connections
between IC with some popular self-supervised learning paradigms. Due to the
page limitation, we only refer to highly important papers in this short survey
and more related works can be found at
https://github.com/SjokerLily/awesome-image-captioning.
Related papers
- STA-Unet: Rethink the semantic redundant for Medical Imaging Segmentation [1.9526521731584066]
Super Token Attention (STA) mechanism adapts the concept of superpixels from pixel space to token space, using super tokens as compact visual representations.
In this work, we introduce the STA module in the UNet architecture (STA-UNet), to limit redundancy without losing rich information.
Experimental results on four publicly available datasets demonstrate the superiority of STA-UNet over existing state-of-the-art architectures.
arXiv Detail & Related papers (2024-10-13T07:19:46Z) - Dilated-UNet: A Fast and Accurate Medical Image Segmentation Approach
using a Dilated Transformer and U-Net Architecture [0.6445605125467572]
This paper introduces Dilated-UNet, which combines a Dilated Transformer block with the U-Net architecture for accurate and fast medical image segmentation.
The results of our experiments show that Dilated-UNet outperforms other models on several challenging medical image segmentation datasets.
arXiv Detail & Related papers (2023-04-22T17:20:13Z) - LAVT: Language-Aware Vision Transformer for Referring Image Segmentation [80.54244087314025]
We show that better cross-modal alignments can be achieved through the early fusion of linguistic and visual features in vision Transformer encoder network.
Our method surpasses the previous state-of-the-art methods on RefCOCO, RefCO+, and G-Ref by large margins.
arXiv Detail & Related papers (2021-12-04T04:53:35Z) - IICNet: A Generic Framework for Reversible Image Conversion [40.21904131503064]
Reversible image conversion (RIC) aims to build a reversible transformation between specific visual content (e.g., short videos) and an embedding image.
This work develops Invertible Image Conversion Net (IICNet) as a generic solution to various RIC tasks due to its strong capacity and task-independent design.
arXiv Detail & Related papers (2021-09-09T13:06:59Z) - UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation [6.646135062704341]
Transformer architecture has emerged to be successful in a number of natural language processing tasks.
We present UTNet, a powerful hybrid Transformer architecture that integrates self-attention into a convolutional neural network for enhancing medical image segmentation.
arXiv Detail & Related papers (2021-07-02T00:56:27Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation [63.46694853953092]
Swin-Unet is an Unet-like pure Transformer for medical image segmentation.
tokenized image patches are fed into the Transformer-based U-shaped decoder-Decoder architecture.
arXiv Detail & Related papers (2021-05-12T09:30:26Z) - Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence.
Transformers require minimal inductive biases for their design and are naturally suited as set-functions.
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z) - Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective
with Transformers [149.78470371525754]
We treat semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer to encode an image as a sequence of patches.
With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR)
SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes.
arXiv Detail & Related papers (2020-12-31T18:55:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.