Improving Image Captioning by Leveraging Intra- and Inter-layer Global
Representation in Transformer Network
- URL: http://arxiv.org/abs/2012.07061v1
- Date: Sun, 13 Dec 2020 13:38:58 GMT
- Title: Improving Image Captioning by Leveraging Intra- and Inter-layer Global
Representation in Transformer Network
- Authors: Jiayi Ji, Yunpeng Luo, Xiaoshuai Sun, Fuhai Chen, Gen Luo, Yongjian
Wu, Yue Gao, Rongrong Ji
- Abstract summary: We introduce a Global Enhanced Transformer (termed GET) to enable the extraction of a more comprehensive global representation.
GET adaptively guides the decoder to generate high-quality captions.
- Score: 96.4761273757796
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer-based architectures have shown great success in image captioning,
where object regions are encoded and then attended into the vectorial
representations to guide the caption decoding. However, such vectorial
representations only contain region-level information without considering the
global information reflecting the entire image, which fails to expand the
capability of complex multi-modal reasoning in image captioning. In this paper,
we introduce a Global Enhanced Transformer (termed GET) to enable the
extraction of a more comprehensive global representation, and then adaptively
guide the decoder to generate high-quality captions. In GET, a Global Enhanced
Encoder is designed for the embedding of the global feature, and a Global
Adaptive Decoder are designed for the guidance of the caption generation. The
former models intra- and inter-layer global representation by taking advantage
of the proposed Global Enhanced Attention and a layer-wise fusion module. The
latter contains a Global Adaptive Controller that can adaptively fuse the
global information into the decoder to guide the caption generation. Extensive
experiments on MS COCO dataset demonstrate the superiority of our GET over many
state-of-the-arts.
Related papers
- Zero-shot Text-guided Infinite Image Synthesis with LLM guidance [2.531998650341267]
There is a lack of text-image paired datasets with high-resolution and contextual diversity.
Expanding images based on text requires global coherence and rich local context understanding.
We propose a novel approach utilizing Large Language Models (LLMs) for both global coherence and local context understanding.
arXiv Detail & Related papers (2024-07-17T15:10:01Z) - Other Tokens Matter: Exploring Global and Local Features of Vision Transformers for Object Re-Identification [63.147482497821166]
We first explore the influence of global and local features of ViT and then propose a novel Global-Local Transformer (GLTrans) for high-performance object Re-ID.
Our proposed method achieves superior performance on four object Re-ID benchmarks.
arXiv Detail & Related papers (2024-04-23T12:42:07Z) - Recursive Generalization Transformer for Image Super-Resolution [108.67898547357127]
We propose the Recursive Generalization Transformer (RGT) for image SR, which can capture global spatial information and is suitable for high-resolution images.
We combine the RG-SA with local self-attention to enhance the exploitation of the global context.
Our RGT outperforms recent state-of-the-art methods quantitatively and qualitatively.
arXiv Detail & Related papers (2023-03-11T10:44:44Z) - Rethinking Global Context in Crowd Counting [70.54184500538338]
A pure transformer is used to extract features with global information from overlapping image patches.
Inspired by classification, we add a context token to the input sequence, to facilitate information exchange with tokens corresponding to image patches.
arXiv Detail & Related papers (2021-05-23T12:44:27Z) - Understanding Guided Image Captioning Performance across Domains [22.283016988026926]
We present a method to control the concepts that an image caption should focus on, using an additional input called the guiding text.
Our human-evaluation results indicate that attempting in-the-wild guided image captioning requires access to large, unrestricted-domain training datasets.
arXiv Detail & Related papers (2020-12-04T00:05:02Z) - A U-Net Based Discriminator for Generative Adversarial Networks [86.67102929147592]
We propose an alternative U-Net based discriminator architecture for generative adversarial networks (GANs)
The proposed architecture allows to provide detailed per-pixel feedback to the generator while maintaining the global coherence of synthesized images.
The novel discriminator improves over the state of the art in terms of the standard distribution and image quality metrics.
arXiv Detail & Related papers (2020-02-28T11:16:54Z) - GRET: Global Representation Enhanced Transformer [85.58930151690336]
Transformer, based on the encoder-decoder framework, has achieved state-of-the-art performance on several natural language generation tasks.
We propose a novel global representation enhanced Transformer (GRET) to explicitly model global representation in the Transformer network.
arXiv Detail & Related papers (2020-02-24T07:37:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.