Image Captioning via Dynamic Path Customization
- URL: http://arxiv.org/abs/2406.00334v1
- Date: Sat, 1 Jun 2024 07:23:21 GMT
- Title: Image Captioning via Dynamic Path Customization
- Authors: Yiwei Ma, Jiayi Ji, Xiaoshuai Sun, Yiyi Zhou, Xiaopeng Hong, Yongjian Wu, Rongrong Ji,
- Abstract summary: We propose a novel Dynamic Transformer Network (DTNet) for image captioning, which dynamically assigns customized paths to different samples, leading to discriminative yet accurate captions.
To validate the effectiveness of our proposed DTNet, we conduct extensive experiments on the MS-COCO dataset and achieve new state-of-the-art performance.
- Score: 100.15412641586525
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper explores a novel dynamic network for vision and language tasks, where the inferring structure is customized on the fly for different inputs. Most previous state-of-the-art approaches are static and hand-crafted networks, which not only heavily rely on expert knowledge, but also ignore the semantic diversity of input samples, therefore resulting in suboptimal performance. To address these issues, we propose a novel Dynamic Transformer Network (DTNet) for image captioning, which dynamically assigns customized paths to different samples, leading to discriminative yet accurate captions. Specifically, to build a rich routing space and improve routing efficiency, we introduce five types of basic cells and group them into two separate routing spaces according to their operating domains, i.e., spatial and channel. Then, we design a Spatial-Channel Joint Router (SCJR), which endows the model with the capability of path customization based on both spatial and channel information of the input sample. To validate the effectiveness of our proposed DTNet, we conduct extensive experiments on the MS-COCO dataset and achieve new state-of-the-art performance on both the Karpathy split and the online test server.
Related papers
- Self-supervised Pre-training for Semantic Segmentation in an Indoor
Scene [8.357801312689622]
We propose RegConsist, a method for self-supervised pre-training of a semantic segmentation model.
We use a variant of contrastive learning to train a DCNN model for predicting semantic segmentation from RGB views in the target environment.
The proposed method outperforms models pre-trained on ImageNet and achieves competitive performance when using models that are trained for exactly the same task but on a different dataset.
arXiv Detail & Related papers (2022-10-04T20:10:14Z) - DPTNet: A Dual-Path Transformer Architecture for Scene Text Detection [34.42038300372715]
We propose DPTNet, a simple yet effective architecture to model the global and local information for the scene text detection task.
We propose a parallel design that integrates the convolutional network with a powerful self-attention mechanism to provide complementary clues between the attention path and convolutional path.
Our DPTNet achieves state-of-the-art results on the MSRA-TD500 dataset, and provides competitive results on other standard benchmarks in terms of both detection accuracy and speed.
arXiv Detail & Related papers (2022-08-21T12:58:45Z) - Dual-Level Collaborative Transformer for Image Captioning [126.59298716978577]
We introduce a novel Dual-Level Collaborative Transformer (DLCT) network to realize the complementary advantages of the two features.
In addition, we propose a Locality-Constrained Cross Attention module to address the semantic noises caused by the direct fusion of these two features.
arXiv Detail & Related papers (2021-01-16T15:43:17Z) - Learning to Generate Content-Aware Dynamic Detectors [62.74209921174237]
We introduce a newpective of designing efficient detectors, which is automatically generating sample-adaptive model architecture.
We introduce a course-to-fine strat-egy tailored for object detection to guide the learning of dynamic routing.
Experiments on MS-COCO dataset demonstrate that CADDet achieves 1.8 higher mAP with 10% fewer FLOPs compared with vanilla routing.
arXiv Detail & Related papers (2020-12-08T08:05:20Z) - CoDiNet: Path Distribution Modeling with Consistency and Diversity for
Dynamic Routing [19.296118763012146]
We see dynamic routing networks in a fresh light, formulating a routing method as a mapping from a sample space to a routing space.
We propose a novel method, termed CoDiNet, to model the relationship between a sample space and a routing space.
Specifically, samples with similar semantics should be mapped into the same area in routing space, while those with dissimilar semantics should be mapped into different areas.
arXiv Detail & Related papers (2020-05-29T08:09:21Z) - Learning Dynamic Routing for Semantic Segmentation [86.56049245100084]
This paper studies a conceptually new method to alleviate the scale variance in semantic representation, named dynamic routing.
The proposed framework generates data-dependent routes, adapting to the scale distribution of each image.
To this end, a differentiable gating function, called soft conditional gate, is proposed to select scale transform paths on the fly.
arXiv Detail & Related papers (2020-03-23T17:22:14Z) - Multi-Channel Attention Selection GANs for Guided Image-to-Image
Translation [148.9985519929653]
We propose a novel model named Multi-Channel Attention Selection Generative Adversarial Network (SelectionGAN) for guided image-to-image translation.
The proposed framework and modules are unified solutions and can be applied to solve other generation tasks such as semantic image synthesis.
arXiv Detail & Related papers (2020-02-03T23:17:10Z) - Gated Path Selection Network for Semantic Segmentation [72.44994579325822]
We develop a novel network named Gated Path Selection Network (GPSNet), which aims to learn adaptive receptive fields.
In GPSNet, we first design a two-dimensional multi-scale network - SuperNet, which densely incorporates features from growing receptive fields.
To dynamically select desirable semantic context, a gate prediction module is further introduced.
arXiv Detail & Related papers (2020-01-19T12:32:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.