Towards Local Visual Modeling for Image Captioning
- URL: http://arxiv.org/abs/2302.06098v1
- Date: Mon, 13 Feb 2023 04:42:00 GMT
- Title: Towards Local Visual Modeling for Image Captioning
- Authors: Yiwei Ma, Jiayi Ji, Xiaoshuai Sun, Yiyi Zhou, Rongrong Ji
- Abstract summary: We propose a Locality-Sensitive Transformer Network (LSTNet) with two novel designs, namely Locality-Sensitive Attention (LSA) and Locality-Sensitive Fusion (LSF)
LSA is deployed for the intra-layer interaction in Transformer via modeling the relationship between each grid and its neighbors.
LSF is used for inter-layer information fusion, which aggregates the information of different encoder layers for cross-layer semantical complementarity.
- Score: 87.02744388237045
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we study the local visual modeling with grid features for
image captioning, which is critical for generating accurate and detailed
captions. To achieve this target, we propose a Locality-Sensitive Transformer
Network (LSTNet) with two novel designs, namely Locality-Sensitive Attention
(LSA) and Locality-Sensitive Fusion (LSF). LSA is deployed for the intra-layer
interaction in Transformer via modeling the relationship between each grid and
its neighbors. It reduces the difficulty of local object recognition during
captioning. LSF is used for inter-layer information fusion, which aggregates
the information of different encoder layers for cross-layer semantical
complementarity. With these two novel designs, the proposed LSTNet can model
the local visual information of grid features to improve the captioning
quality. To validate LSTNet, we conduct extensive experiments on the
competitive MS-COCO benchmark. The experimental results show that LSTNet is not
only capable of local visual modeling, but also outperforms a bunch of
state-of-the-art captioning models on offline and online testings, i.e., 134.8
CIDEr and 136.3 CIDEr, respectively. Besides, the generalization of LSTNet is
also verified on the Flickr8k and Flickr30k datasets
Related papers
- Selective Vision-Language Subspace Projection for Few-shot CLIP [55.361337202198925]
We introduce a method called Selective Vision-Language Subspace Projection (SSP)
SSP incorporates local image features and utilizes them as a bridge to enhance the alignment between image-text pairs.
Our approach entails only training-free matrix calculations and can be seamlessly integrated into advanced CLIP-based few-shot learning frameworks.
arXiv Detail & Related papers (2024-07-24T03:45:35Z) - LAW-Diffusion: Complex Scene Generation by Diffusion with Layouts [107.11267074981905]
We propose a semantically controllable layout-AWare diffusion model, termed LAW-Diffusion.
We show that LAW-Diffusion yields the state-of-the-art generative performance, especially with coherent object relations.
arXiv Detail & Related papers (2023-08-13T08:06:18Z) - LadleNet: A Two-Stage UNet for Infrared Image to Visible Image Translation Guided by Semantic Segmentation [5.125530969984795]
We propose an improved algorithm for image translation based on U-net called LadleNet.
LadleNet+ replaces the Handle module in LadleNet with a pre-trained DeepLabv3+ network, enabling the model to have a more powerful capability in constructing semantic space.
Compared to existing methods, LadleNet and LadleNet+ achieved an average improvement of 12.4% and 15.2% in SSIM metrics, and 37.9% and 50.6% in MS-SSIM metrics, respectively.
arXiv Detail & Related papers (2023-08-12T16:14:44Z) - LCPFormer: Towards Effective 3D Point Cloud Analysis via Local Context
Propagation in Transformers [60.51925353387151]
We propose a novel module named Local Context Propagation (LCP) to exploit the message passing between neighboring local regions.
We use the overlap points of adjacent local regions as intermediaries, then re-weight the features of these shared points from different local regions before passing them to the next layers.
The proposed method is applicable to different tasks and outperforms various transformer-based methods in benchmarks including 3D shape classification and dense prediction tasks.
arXiv Detail & Related papers (2022-10-23T15:43:01Z) - Dual-Level Collaborative Transformer for Image Captioning [126.59298716978577]
We introduce a novel Dual-Level Collaborative Transformer (DLCT) network to realize the complementary advantages of the two features.
In addition, we propose a Locality-Constrained Cross Attention module to address the semantic noises caused by the direct fusion of these two features.
arXiv Detail & Related papers (2021-01-16T15:43:17Z) - Generating Descriptions for Sequential Images with Local-Object
Attention and Global Semantic Context Modelling [5.362051433497476]
We propose an end-to-end CNN-LSTM model for generating descriptions for sequential images with a local-object attention mechanism.
We capture global semantic context using a multi-layer perceptron, which learns the dependencies between sequential images.
A paralleled LSTM network is exploited for decoding the sequence descriptions.
arXiv Detail & Related papers (2020-12-02T16:07:32Z) - Local Context Attention for Salient Object Segmentation [5.542044768017415]
We propose a novel Local Context Attention Network (LCANet) to generate locally reinforcement feature maps in a uniform representational architecture.
The proposed network introduces an Attentional Correlation Filter (ACF) module to generate explicit local attention by calculating the correlation feature map between coarse prediction and global context.
Comprehensive experiments are conducted on several salient object segmentation datasets, demonstrating the superior performance of the proposed LCANet against the state-of-the-art methods.
arXiv Detail & Related papers (2020-09-24T09:20:06Z) - EPNet: Enhancing Point Features with Image Semantics for 3D Object
Detection [60.097873683615695]
We aim at addressing two critical issues in the 3D detection task, including the exploitation of multiple sensors.
We propose a novel fusion module to enhance the point features with semantic image features in a point-wise manner without any image annotations.
We design an end-to-end learnable framework named EPNet to integrate these two components.
arXiv Detail & Related papers (2020-07-17T09:33:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.