Dual-Level Collaborative Transformer for Image Captioning
- URL: http://arxiv.org/abs/2101.06462v1
- Date: Sat, 16 Jan 2021 15:43:17 GMT
- Title: Dual-Level Collaborative Transformer for Image Captioning
- Authors: Yunpeng Luo, Jiayi Ji, Xiaoshuai Sun, Liujuan Cao, Yongjian Wu, Feiyue
Huang, Chia-Wen Lin, Rongrong Ji
- Abstract summary: We introduce a novel Dual-Level Collaborative Transformer (DLCT) network to realize the complementary advantages of the two features.
In addition, we propose a Locality-Constrained Cross Attention module to address the semantic noises caused by the direct fusion of these two features.
- Score: 126.59298716978577
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Descriptive region features extracted by object detection networks have
played an important role in the recent advancements of image captioning.
However, they are still criticized for the lack of contextual information and
fine-grained details, which in contrast are the merits of traditional grid
features. In this paper, we introduce a novel Dual-Level Collaborative
Transformer (DLCT) network to realize the complementary advantages of the two
features. Concretely, in DLCT, these two features are first processed by a
novelDual-way Self Attenion (DWSA) to mine their intrinsic properties, where a
Comprehensive Relation Attention component is also introduced to embed the
geometric information. In addition, we propose a Locality-Constrained Cross
Attention module to address the semantic noises caused by the direct fusion of
these two features, where a geometric alignment graph is constructed to
accurately align and reinforce region and grid features. To validate our model,
we conduct extensive experiments on the highly competitive MS-COCO dataset, and
achieve new state-of-the-art performance on both local and online test sets,
i.e., 133.8% CIDEr-D on Karpathy split and 135.4% CIDEr on the official split.
Code is available at https://github.com/luo3300612/image-captioning-DLCT.
Related papers
- Towards Local Visual Modeling for Image Captioning [87.02744388237045]
We propose a Locality-Sensitive Transformer Network (LSTNet) with two novel designs, namely Locality-Sensitive Attention (LSA) and Locality-Sensitive Fusion (LSF)
LSA is deployed for the intra-layer interaction in Transformer via modeling the relationship between each grid and its neighbors.
LSF is used for inter-layer information fusion, which aggregates the information of different encoder layers for cross-layer semantical complementarity.
arXiv Detail & Related papers (2023-02-13T04:42:00Z) - Hybrid Routing Transformer for Zero-Shot Learning [83.64532548391]
This paper presents a novel transformer encoder-decoder model, called hybrid routing transformer (HRT)
We embed an active attention, which is constructed by both the bottom-up and the top-down dynamic routing pathways to generate the attribute-aligned visual feature.
While in HRT decoder, we use static routing to calculate the correlation among the attribute-aligned visual features, the corresponding attribute semantics, and the class attribute vectors to generate the final class label predictions.
arXiv Detail & Related papers (2022-03-29T07:55:08Z) - Siamese Attribute-missing Graph Auto-encoder [35.79233150253881]
We propose Siamese Attribute-missing Graph Auto-encoder (SAGA)
First, we entangle the attribute embedding and structure embedding by introducing a siamese network structure to share the parameters learned by both processes.
Second, we introduce a K-nearest neighbor (KNN) and structural constraint enhanced learning mechanism to improve the quality of latent features of the missing attributes.
arXiv Detail & Related papers (2021-12-09T11:21:31Z) - Similarity-Aware Fusion Network for 3D Semantic Segmentation [87.51314162700315]
We propose a similarity-aware fusion network (SAFNet) to adaptively fuse 2D images and 3D point clouds for 3D semantic segmentation.
We employ a late fusion strategy where we first learn the geometric and contextual similarities between the input and back-projected (from 2D pixels) point clouds.
We show that SAFNet significantly outperforms existing state-of-the-art fusion-based approaches across various data integrity.
arXiv Detail & Related papers (2021-07-04T09:28:18Z) - Dual Attention GANs for Semantic Image Synthesis [101.36015877815537]
We propose a novel Dual Attention GAN (DAGAN) to synthesize photo-realistic and semantically-consistent images.
We also propose two novel modules, i.e., position-wise Spatial Attention Module (SAM) and scale-wise Channel Attention Module (CAM)
DAGAN achieves remarkably better results than state-of-the-art methods, while using fewer model parameters.
arXiv Detail & Related papers (2020-08-29T17:49:01Z) - EPNet: Enhancing Point Features with Image Semantics for 3D Object
Detection [60.097873683615695]
We aim at addressing two critical issues in the 3D detection task, including the exploitation of multiple sensors.
We propose a novel fusion module to enhance the point features with semantic image features in a point-wise manner without any image annotations.
We design an end-to-end learnable framework named EPNet to integrate these two components.
arXiv Detail & Related papers (2020-07-17T09:33:05Z) - aiTPR: Attribute Interaction-Tensor Product Representation for Image
Caption [9.89901717499058]
Region visual features enhance the generative capability of the machines based on features, however they lack proper interaction attentional perceptions.
In this work, we propose Attribute Interaction-Tensor Product Representation (aiTPR) which is a convenient way of gathering more information.
arXiv Detail & Related papers (2020-01-27T00:19:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.