Related papers: Dual-Level Collaborative Transformer for Image Captioning

Dual-Level Collaborative Transformer for Image Captioning

URL: http://arxiv.org/abs/2101.06462v1
Date: Sat, 16 Jan 2021 15:43:17 GMT
Title: Dual-Level Collaborative Transformer for Image Captioning
Authors: Yunpeng Luo, Jiayi Ji, Xiaoshuai Sun, Liujuan Cao, Yongjian Wu, Feiyue Huang, Chia-Wen Lin, Rongrong Ji
Abstract summary: We introduce a novel Dual-Level Collaborative Transformer (DLCT) network to realize the complementary advantages of the two features. In addition, we propose a Locality-Constrained Cross Attention module to address the semantic noises caused by the direct fusion of these two features.
Score: 126.59298716978577
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Descriptive region features extracted by object detection networks have played an important role in the recent advancements of image captioning. However, they are still criticized for the lack of contextual information and fine-grained details, which in contrast are the merits of traditional grid features. In this paper, we introduce a novel Dual-Level Collaborative Transformer (DLCT) network to realize the complementary advantages of the two features. Concretely, in DLCT, these two features are first processed by a novelDual-way Self Attenion (DWSA) to mine their intrinsic properties, where a Comprehensive Relation Attention component is also introduced to embed the geometric information. In addition, we propose a Locality-Constrained Cross Attention module to address the semantic noises caused by the direct fusion of these two features, where a geometric alignment graph is constructed to accurately align and reinforce region and grid features. To validate our model, we conduct extensive experiments on the highly competitive MS-COCO dataset, and achieve new state-of-the-art performance on both local and online test sets, i.e., 133.8% CIDEr-D on Karpathy split and 135.4% CIDEr on the official split. Code is available at https://github.com/luo3300612/image-captioning-DLCT.

Related papers

OFFSET: Segmentation-based Focus Shift Revision for Composed Image Retrieval [59.377821673653436]
Composed Image Retrieval (CIR) is capable of expressing users' intricate retrieval requirements flexibly.<n>CIR remains in its nascent stages due to two limitations: 1) inhomogeneity between dominant and noisy portions in visual data is ignored, leading to query feature degradation.<n>This work presents a focus mapping-based feature extractor, which consists of two modules: dominant portion segmentation and dual focus mapping.
arXiv Detail & Related papers (2025-07-08T03:27:46Z)
Towards Local Visual Modeling for Image Captioning [87.02744388237045]
We propose a Locality-Sensitive Transformer Network (LSTNet) with two novel designs, namely Locality-Sensitive Attention (LSA) and Locality-Sensitive Fusion (LSF) LSA is deployed for the intra-layer interaction in Transformer via modeling the relationship between each grid and its neighbors. LSF is used for inter-layer information fusion, which aggregates the information of different encoder layers for cross-layer semantical complementarity.
arXiv Detail & Related papers (2023-02-13T04:42:00Z)
Hybrid Routing Transformer for Zero-Shot Learning [83.64532548391]
This paper presents a novel transformer encoder-decoder model, called hybrid routing transformer (HRT) We embed an active attention, which is constructed by both the bottom-up and the top-down dynamic routing pathways to generate the attribute-aligned visual feature. While in HRT decoder, we use static routing to calculate the correlation among the attribute-aligned visual features, the corresponding attribute semantics, and the class attribute vectors to generate the final class label predictions.
arXiv Detail & Related papers (2022-03-29T07:55:08Z)
Siamese Attribute-missing Graph Auto-encoder [35.79233150253881]
We propose Siamese Attribute-missing Graph Auto-encoder (SAGA) First, we entangle the attribute embedding and structure embedding by introducing a siamese network structure to share the parameters learned by both processes. Second, we introduce a K-nearest neighbor (KNN) and structural constraint enhanced learning mechanism to improve the quality of latent features of the missing attributes.
arXiv Detail & Related papers (2021-12-09T11:21:31Z)
Similarity-Aware Fusion Network for 3D Semantic Segmentation [87.51314162700315]
We propose a similarity-aware fusion network (SAFNet) to adaptively fuse 2D images and 3D point clouds for 3D semantic segmentation. We employ a late fusion strategy where we first learn the geometric and contextual similarities between the input and back-projected (from 2D pixels) point clouds. We show that SAFNet significantly outperforms existing state-of-the-art fusion-based approaches across various data integrity.
arXiv Detail & Related papers (2021-07-04T09:28:18Z)
Dual Attention GANs for Semantic Image Synthesis [101.36015877815537]
We propose a novel Dual Attention GAN (DAGAN) to synthesize photo-realistic and semantically-consistent images. We also propose two novel modules, i.e., position-wise Spatial Attention Module (SAM) and scale-wise Channel Attention Module (CAM) DAGAN achieves remarkably better results than state-of-the-art methods, while using fewer model parameters.
arXiv Detail & Related papers (2020-08-29T17:49:01Z)
EPNet: Enhancing Point Features with Image Semantics for 3D Object Detection [60.097873683615695]
We aim at addressing two critical issues in the 3D detection task, including the exploitation of multiple sensors. We propose a novel fusion module to enhance the point features with semantic image features in a point-wise manner without any image annotations. We design an end-to-end learnable framework named EPNet to integrate these two components.
arXiv Detail & Related papers (2020-07-17T09:33:05Z)
aiTPR: Attribute Interaction-Tensor Product Representation for Image Caption [9.89901717499058]
Region visual features enhance the generative capability of the machines based on features, however they lack proper interaction attentional perceptions. In this work, we propose Attribute Interaction-Tensor Product Representation (aiTPR) which is a convenient way of gathering more information.
arXiv Detail & Related papers (2020-01-27T00:19:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.