Dual-Level Collaborative Transformer for Image Captioning
- URL: http://arxiv.org/abs/2101.06462v1
- Date: Sat, 16 Jan 2021 15:43:17 GMT
- Title: Dual-Level Collaborative Transformer for Image Captioning
- Authors: Yunpeng Luo, Jiayi Ji, Xiaoshuai Sun, Liujuan Cao, Yongjian Wu, Feiyue
Huang, Chia-Wen Lin, Rongrong Ji
- Abstract summary: We introduce a novel Dual-Level Collaborative Transformer (DLCT) network to realize the complementary advantages of the two features.
In addition, we propose a Locality-Constrained Cross Attention module to address the semantic noises caused by the direct fusion of these two features.
- Score: 126.59298716978577
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Descriptive region features extracted by object detection networks have
played an important role in the recent advancements of image captioning.
However, they are still criticized for the lack of contextual information and
fine-grained details, which in contrast are the merits of traditional grid
features. In this paper, we introduce a novel Dual-Level Collaborative
Transformer (DLCT) network to realize the complementary advantages of the two
features. Concretely, in DLCT, these two features are first processed by a
novelDual-way Self Attenion (DWSA) to mine their intrinsic properties, where a
Comprehensive Relation Attention component is also introduced to embed the
geometric information. In addition, we propose a Locality-Constrained Cross
Attention module to address the semantic noises caused by the direct fusion of
these two features, where a geometric alignment graph is constructed to
accurately align and reinforce region and grid features. To validate our model,
we conduct extensive experiments on the highly competitive MS-COCO dataset, and
achieve new state-of-the-art performance on both local and online test sets,
i.e., 133.8% CIDEr-D on Karpathy split and 135.4% CIDEr on the official split.
Code is available at https://github.com/luo3300612/image-captioning-DLCT.
Related papers
- Entropy-Aware Structural Alignment for Zero-Shot Handwritten Chinese Character Recognition [7.632962062462334]
Zero-shot Handwritten Chinese Character Recognition aims to recognize unseen characters by leveraging radical-based semantic compositions.<n>We propose an Entropy-Aware Structural Alignment Network that bridges the visual-semantic gap through information-theoretic modeling.<n>Our method establishes new state-of-the-art performance, achieving an accuracy of 55.04% on the ICDAR 2013 dataset.
arXiv Detail & Related papers (2026-02-03T16:08:40Z) - Dual-Stream Collaborative Transformer for Image Captioning [25.901654895839613]
We propose a Dual-Stream Collaborative Transformer (DSCT) to address this issue by introducing the segmentation feature.<n>The proposed DSCT consolidates and then fuses the region and segmentation features to guide the generation of caption sentences.<n>The experimental results from popular benchmark datasets demonstrate that our DSCT outperforms the state-of-the-art image captioning models in the literature.
arXiv Detail & Related papers (2026-01-19T10:28:56Z) - Complementary Information Guided Occupancy Prediction via Multi-Level Representation Fusion [73.11061598576798]
Camera-based occupancy prediction is a mainstream approach for 3D perception in autonomous driving.<n>textbfCIGOcc is a two-stage occupancy prediction framework based on multi-level representation fusion.<n>textbfCIGOcc extracts segmentation, graphics, and depth features from an input image and introduces a deformable multi-level fusion mechanism.
arXiv Detail & Related papers (2025-10-15T06:37:33Z) - OFFSET: Segmentation-based Focus Shift Revision for Composed Image Retrieval [59.377821673653436]
Composed Image Retrieval (CIR) is capable of expressing users' intricate retrieval requirements flexibly.<n>CIR remains in its nascent stages due to two limitations: 1) inhomogeneity between dominant and noisy portions in visual data is ignored, leading to query feature degradation.<n>This work presents a focus mapping-based feature extractor, which consists of two modules: dominant portion segmentation and dual focus mapping.
arXiv Detail & Related papers (2025-07-08T03:27:46Z) - Towards Local Visual Modeling for Image Captioning [87.02744388237045]
We propose a Locality-Sensitive Transformer Network (LSTNet) with two novel designs, namely Locality-Sensitive Attention (LSA) and Locality-Sensitive Fusion (LSF)
LSA is deployed for the intra-layer interaction in Transformer via modeling the relationship between each grid and its neighbors.
LSF is used for inter-layer information fusion, which aggregates the information of different encoder layers for cross-layer semantical complementarity.
arXiv Detail & Related papers (2023-02-13T04:42:00Z) - Hybrid Routing Transformer for Zero-Shot Learning [83.64532548391]
This paper presents a novel transformer encoder-decoder model, called hybrid routing transformer (HRT)
We embed an active attention, which is constructed by both the bottom-up and the top-down dynamic routing pathways to generate the attribute-aligned visual feature.
While in HRT decoder, we use static routing to calculate the correlation among the attribute-aligned visual features, the corresponding attribute semantics, and the class attribute vectors to generate the final class label predictions.
arXiv Detail & Related papers (2022-03-29T07:55:08Z) - Siamese Attribute-missing Graph Auto-encoder [35.79233150253881]
We propose Siamese Attribute-missing Graph Auto-encoder (SAGA)
First, we entangle the attribute embedding and structure embedding by introducing a siamese network structure to share the parameters learned by both processes.
Second, we introduce a K-nearest neighbor (KNN) and structural constraint enhanced learning mechanism to improve the quality of latent features of the missing attributes.
arXiv Detail & Related papers (2021-12-09T11:21:31Z) - Similarity-Aware Fusion Network for 3D Semantic Segmentation [87.51314162700315]
We propose a similarity-aware fusion network (SAFNet) to adaptively fuse 2D images and 3D point clouds for 3D semantic segmentation.
We employ a late fusion strategy where we first learn the geometric and contextual similarities between the input and back-projected (from 2D pixels) point clouds.
We show that SAFNet significantly outperforms existing state-of-the-art fusion-based approaches across various data integrity.
arXiv Detail & Related papers (2021-07-04T09:28:18Z) - Dual Attention GANs for Semantic Image Synthesis [101.36015877815537]
We propose a novel Dual Attention GAN (DAGAN) to synthesize photo-realistic and semantically-consistent images.
We also propose two novel modules, i.e., position-wise Spatial Attention Module (SAM) and scale-wise Channel Attention Module (CAM)
DAGAN achieves remarkably better results than state-of-the-art methods, while using fewer model parameters.
arXiv Detail & Related papers (2020-08-29T17:49:01Z) - EPNet: Enhancing Point Features with Image Semantics for 3D Object
Detection [60.097873683615695]
We aim at addressing two critical issues in the 3D detection task, including the exploitation of multiple sensors.
We propose a novel fusion module to enhance the point features with semantic image features in a point-wise manner without any image annotations.
We design an end-to-end learnable framework named EPNet to integrate these two components.
arXiv Detail & Related papers (2020-07-17T09:33:05Z) - aiTPR: Attribute Interaction-Tensor Product Representation for Image
Caption [9.89901717499058]
Region visual features enhance the generative capability of the machines based on features, however they lack proper interaction attentional perceptions.
In this work, we propose Attribute Interaction-Tensor Product Representation (aiTPR) which is a convenient way of gathering more information.
arXiv Detail & Related papers (2020-01-27T00:19:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.