Compact Bidirectional Transformer for Image Captioning
- URL: http://arxiv.org/abs/2201.01984v1
- Date: Thu, 6 Jan 2022 09:23:18 GMT
- Title: Compact Bidirectional Transformer for Image Captioning
- Authors: Yuanen Zhou, Zhenzhen Hu, Daqing Liu, Huixia Ben, Meng Wang
- Abstract summary: We introduce a Compact Bidirectional Transformer model for image captioning that can leverage bidirectional context implicitly and explicitly.
We conduct extensive ablation studies on the MSCOCO benchmark and find that the compact architecture serves as a regularization for implicitly exploiting bidirectional context.
We achieve new state-of-the-art results in comparison with non-vision-language-pretraining models.
- Score: 15.773455578749118
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most current image captioning models typically generate captions from left to
right. This unidirectional property makes them can only leverage past context
but not future context. Though recent refinement-based models can exploit both
past and future context by generating a new caption in the second stage based
on pre-retrieved or pre-generated captions in the first stage, the decoder of
these models generally consists of two networks~(i.e. a retriever or captioner
in the first stage and a refiner in the second stage), which can only be
executed sequentially. In this paper, we introduce a Compact Bidirectional
Transformer model for image captioning that can leverage bidirectional context
implicitly and explicitly while the decoder can be executed parallelly.
Specifically, it is implemented by tightly coupling left-to-right(L2R) and
right-to-left(R2L) flows into a single compact model~(i.e. implicitly) and
optionally allowing interaction of the two flows(i.e. explicitly), while the
final caption is chosen from either L2R or R2L flow in a sentence-level
ensemble manner. We conduct extensive ablation studies on the MSCOCO benchmark
and find that the compact architecture, which serves as a regularization for
implicitly exploiting bidirectional context, and the sentence-level ensemble
play more important roles than the explicit interaction mechanism. By combining
with word-level ensemble seamlessly, the effect of the sentence-level ensemble
is further enlarged. We further extend the conventional one-flow self-critical
training to the two-flows version under this architecture and achieve new
state-of-the-art results in comparison with non-vision-language-pretraining
models. Source code is available at
{\color{magenta}\url{https://github.com/YuanEZhou/CBTrans}}.
Related papers
- Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval [68.61855682218298]
Cross-modal retrieval methods employ two-stream encoders with different architectures for images and texts.
Inspired by recent advances of Transformers in vision tasks, we propose to unify the encoder architectures with Transformers for both modalities.
We design a cross-modal retrieval framework purely based on two-stream Transformers, dubbed textbfHierarchical Alignment Transformers (HAT), which consists of an image Transformer, a text Transformer, and a hierarchical alignment module.
arXiv Detail & Related papers (2023-08-08T15:43:59Z) - Semantic-Conditional Diffusion Networks for Image Captioning [116.86677915812508]
We propose a new diffusion model based paradigm tailored for image captioning, namely Semantic-Conditional Diffusion Networks (SCD-Net)
In SCD-Net, multiple Diffusion Transformer structures are stacked to progressively strengthen the output sentence with better visional-language alignment and linguistical coherence.
Experiments on COCO dataset demonstrate the promising potential of using diffusion models in the challenging image captioning task.
arXiv Detail & Related papers (2022-12-06T16:08:16Z) - Paraphrasing Is All You Need for Novel Object Captioning [126.66301869607656]
Novel object captioning (NOC) aims to describe images containing objects without observing their ground truth captions during training.
We present Paraphrasing-to-Captioning (P2C), a two-stage learning framework for NOC, which wouldally optimize the output captions via paraphrasing.
arXiv Detail & Related papers (2022-09-25T22:56:04Z) - COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for
Cross-Modal Retrieval [59.15034487974549]
We propose a novel COllaborative Two-Stream vision-language pretraining model termed COTS for image-text retrieval.
Our COTS achieves the highest performance among all two-stream methods and comparable performance with 10,800X faster in inference.
Importantly, our COTS is also applicable to text-to-video retrieval, yielding new state-ofthe-art on the widely-used MSR-VTT dataset.
arXiv Detail & Related papers (2022-04-15T12:34:47Z) - ReSTR: Convolution-free Referring Image Segmentation Using Transformers [80.9672131755143]
We present the first convolution-free model for referring image segmentation using transformers, dubbed ReSTR.
Since it extracts features of both modalities through transformer encoders, ReSTR can capture long-range dependencies between entities within each modality.
Also, ReSTR fuses features of the two modalities by a self-attention encoder, which enables flexible and adaptive interactions between the two modalities in the fusion process.
arXiv Detail & Related papers (2022-03-31T02:55:39Z) - Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels.
We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction.
We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z) - L-Verse: Bidirectional Generation Between Image and Text [41.133824156046394]
L-Verse is a novel architecture consisting of feature-augmented variational autoencoder (AugVAE) and bidirectional auto-regressive transformer (BiART)
Our AugVAE shows the state-of-the-art reconstruction performance on ImageNet1K validation set, along with the robustness to unseen images in the wild.
L-Verse can be directly used for image-to-text or text-to-image generation tasks without any finetuning or extra object detection frameworks.
arXiv Detail & Related papers (2021-11-22T11:48:26Z) - ReFormer: The Relational Transformer for Image Captioning [12.184772369145014]
Image captioning is shown to be able to achieve a better performance by using scene graphs to represent the relations of objects in the image.
We propose a novel architecture ReFormer to generate features with relation information embedded.
Our model significantly outperforms state-of-the-art methods on image captioning and scene graph generation.
arXiv Detail & Related papers (2021-07-29T17:03:36Z) - Text to Image Generation with Semantic-Spatial Aware GAN [41.73685713621705]
A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions.
We propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information.
arXiv Detail & Related papers (2021-04-01T15:48:01Z) - SemVLP: Vision-Language Pre-training by Aligning Semantics at Multiple
Levels [35.57369098866317]
Vision-language pre-training on large-scale image-text pairs has witnessed rapid progress for learning cross-modal representations.
We propose a new pre-training method which jointly aligns both the low-level and high-level semantics between image and text representations.
arXiv Detail & Related papers (2021-03-14T02:39:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.