Geometry Attention Transformer with Position-aware LSTMs for Image
Captioning
- URL: http://arxiv.org/abs/2110.00335v1
- Date: Fri, 1 Oct 2021 11:57:50 GMT
- Title: Geometry Attention Transformer with Position-aware LSTMs for Image
Captioning
- Authors: Chi Wang, Yulin Shen, Luping Ji
- Abstract summary: This paper proposes an improved Geometry Attention Transformer (GAT) model.
In order to further leverage geometric information, two novel geometry-aware architectures are designed.
Our GAT could often outperform current state-of-the-art image captioning models.
- Score: 8.944233327731245
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, transformer structures have been widely applied in image
captioning with impressive performance. For good captioning results, the
geometry and position relations of different visual objects are often thought
of as crucial information. Aiming to further promote image captioning by
transformers, this paper proposes an improved Geometry Attention Transformer
(GAT) model. In order to further leverage geometric information, two novel
geometry-aware architectures are designed respectively for the encoder and
decoder in our GAT. Besides, this model includes the two work modules: 1) a
geometry gate-controlled self-attention refiner, for explicitly incorporating
relative spatial information into image region representations in encoding
steps, and 2) a group of position-LSTMs, for precisely informing the decoder of
relative word position in generating caption texts. The experiment comparisons
on the datasets MS COCO and Flickr30K show that our GAT is efficient, and it
could often outperform current state-of-the-art image captioning models.
Related papers
- GTA: A Geometry-Aware Attention Mechanism for Multi-View Transformers [63.41460219156508]
We argue that existing positional encoding schemes are suboptimal for 3D vision tasks.
We propose a geometry-aware attention mechanism that encodes the geometric structure of tokens as relative transformation.
We show that our attention, called Geometric Transform Attention (GTA), improves learning efficiency and performance of state-of-the-art transformer-based NVS models.
arXiv Detail & Related papers (2023-10-16T13:16:09Z) - Beyond One-to-One: Rethinking the Referring Image Segmentation [117.53010476628029]
Referring image segmentation aims to segment the target object referred by a natural language expression.
We propose a Dual Multi-Modal Interaction (DMMI) Network, which contains two decoder branches.
In the text-to-image decoder, text embedding is utilized to query the visual feature and localize the corresponding target.
Meanwhile, the image-to-text decoder is implemented to reconstruct the erased entity-phrase conditioned on the visual feature.
arXiv Detail & Related papers (2023-08-26T11:39:22Z) - Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval [68.61855682218298]
Cross-modal retrieval methods employ two-stream encoders with different architectures for images and texts.
Inspired by recent advances of Transformers in vision tasks, we propose to unify the encoder architectures with Transformers for both modalities.
We design a cross-modal retrieval framework purely based on two-stream Transformers, dubbed textbfHierarchical Alignment Transformers (HAT), which consists of an image Transformer, a text Transformer, and a hierarchical alignment module.
arXiv Detail & Related papers (2023-08-08T15:43:59Z) - Track Targets by Dense Spatio-Temporal Position Encoding [27.06820571703848]
We propose a novel paradigm to encode the position of targets for target tracking in videos using transformers.
The proposed encoding position provides location information to associate targets across frames beyond appearance matching.
Our encoding is applied to the 2D CNN features instead of the proposed feature vectors to avoid losing positional information.
arXiv Detail & Related papers (2022-10-17T22:04:39Z) - PiSLTRc: Position-informed Sign Language Transformer with Content-aware
Convolution [0.42970700836450487]
We propose a new model architecture, namely PiSLTRc, with two distinctive characteristics.
We explicitly select relevant features using a novel content-aware neighborhood gathering method.
We aggregate these features with position-informed temporal convolution layers, thus generating robust neighborhood-enhanced sign representation.
Compared with the vanilla Transformer model, our model performs consistently better on three large-scale sign language benchmarks.
arXiv Detail & Related papers (2021-07-27T05:01:27Z) - Learnable Fourier Features for Multi-DimensionalSpatial Positional
Encoding [96.9752763607738]
We propose a novel positional encoding method based on learnable Fourier features.
Our experiments show that our learnable feature representation for multi-dimensional positional encoding outperforms existing methods.
arXiv Detail & Related papers (2021-06-05T04:40:18Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z) - Dual-Level Collaborative Transformer for Image Captioning [126.59298716978577]
We introduce a novel Dual-Level Collaborative Transformer (DLCT) network to realize the complementary advantages of the two features.
In addition, we propose a Locality-Constrained Cross Attention module to address the semantic noises caused by the direct fusion of these two features.
arXiv Detail & Related papers (2021-01-16T15:43:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.