Image Captioning through Image Transformer
- URL: http://arxiv.org/abs/2004.14231v2
- Date: Fri, 2 Oct 2020 19:26:14 GMT
- Title: Image Captioning through Image Transformer
- Authors: Sen He, Wentong Liao, Hamed R. Tavakoli, Michael Yang, Bodo Rosenhahn,
Nicolas Pugeault
- Abstract summary: We introduce the textbftextitimage transformer, which consists of a modified encoding transformer and an implicit decoding transformer.
Our model achieves new state-of-the-art performance on both MSCOCO offline and online testing benchmarks.
- Score: 29.91581534937757
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic captioning of images is a task that combines the challenges of
image analysis and text generation. One important aspect in captioning is the
notion of attention: How to decide what to describe and in which order.
Inspired by the successes in text analysis and translation, previous work have
proposed the \textit{transformer} architecture for image captioning. However,
the structure between the \textit{semantic units} in images (usually the
detected regions from object detection model) and sentences (each single word)
is different. Limited work has been done to adapt the transformer's internal
architecture to images. In this work, we introduce the \textbf{\textit{image
transformer}}, which consists of a modified encoding transformer and an
implicit decoding transformer, motivated by the relative spatial relationship
between image regions. Our design widen the original transformer layer's inner
architecture to adapt to the structure of images. With only regions feature as
inputs, our model achieves new state-of-the-art performance on both MSCOCO
offline and online testing benchmarks.
Related papers
- Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval [68.61855682218298]
Cross-modal retrieval methods employ two-stream encoders with different architectures for images and texts.
Inspired by recent advances of Transformers in vision tasks, we propose to unify the encoder architectures with Transformers for both modalities.
We design a cross-modal retrieval framework purely based on two-stream Transformers, dubbed textbfHierarchical Alignment Transformers (HAT), which consists of an image Transformer, a text Transformer, and a hierarchical alignment module.
arXiv Detail & Related papers (2023-08-08T15:43:59Z) - Language Guided Local Infiltration for Interactive Image Retrieval [12.324893780690918]
Interactive Image Retrieval (IIR) aims to retrieve images that are generally similar to the reference image but under requested text modification.
We propose a Language Guided Local Infiltration (LGLI) system, which fully utilizes the text information and penetrates text features into image features.
Our method outperforms most state-of-the-art IIR approaches.
arXiv Detail & Related papers (2023-04-16T10:33:08Z) - Neighborhood Contrastive Transformer for Change Captioning [80.10836469177185]
We propose a neighborhood contrastive transformer to improve the model's perceiving ability for various changes under different scenes.
The proposed method achieves the state-of-the-art performance on three public datasets with different change scenarios.
arXiv Detail & Related papers (2023-03-06T14:39:54Z) - ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-wise
Semantic Alignment and Generation [97.36550187238177]
We study a novel task on text-guided image manipulation on the entity level in the real world.
The task imposes three basic requirements, (1) to edit the entity consistent with the text descriptions, (2) to preserve the text-irrelevant regions, and (3) to merge the manipulated entity into the image naturally.
Our framework incorporates a semantic alignment module to locate the image regions to be manipulated, and a semantic loss to help align the relationship between the vision and language.
arXiv Detail & Related papers (2022-04-09T09:01:19Z) - FlexIT: Towards Flexible Semantic Image Translation [59.09398209706869]
We propose FlexIT, a novel method which can take any input image and a user-defined text instruction for editing.
First, FlexIT combines the input image and text into a single target point in the CLIP multimodal embedding space.
We iteratively transform the input image toward the target point, ensuring coherence and quality with a variety of novel regularization terms.
arXiv Detail & Related papers (2022-03-09T13:34:38Z) - Embedding Arithmetic for Text-driven Image Transformation [48.7704684871689]
Text representations exhibit geometric regularities, such as the famous analogy: queen is to king what woman is to man.
Recent works aim at bridging this semantic gap embed images and text into a multimodal space.
We introduce the SIMAT dataset to evaluate the task of text-driven image transformation.
arXiv Detail & Related papers (2021-12-06T16:51:50Z) - Integrating Visuospatial, Linguistic and Commonsense Structure into
Story Visualization [81.26077816854449]
We first explore the use of constituency parse trees for encoding structured input.
Second, we augment the structured input with commonsense information and study the impact of this external knowledge on the generation of visual story.
Third, we incorporate visual structure via bounding boxes and dense captioning to provide feedback about the characters/objects in generated images.
arXiv Detail & Related papers (2021-10-21T00:16:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.