Injecting Semantic Concepts into End-to-End Image Captioning
- URL: http://arxiv.org/abs/2112.05230v1
- Date: Thu, 9 Dec 2021 22:05:05 GMT
- Title: Injecting Semantic Concepts into End-to-End Image Captioning
- Authors: Zhiyuan Fang, Jianfeng Wang, Xiaowei Hu, Lin Liang, Zhe Gan, Lijuan
Wang, Yezhou Yang, Zicheng Liu
- Abstract summary: We propose a pure vision transformer-based image captioning model, dubbed as ViTCAP, in which grid representations are used without extracting the regional features.
For improved performance, we introduce a novel Concept Token Network (CTN) to predict the semantic concepts and then incorporate them into the end-to-end captioning.
In particular, the CTN is built on the basis of a vision transformer and is designed to predict the concept tokens through a classification task.
- Score: 61.41154537334627
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Tremendous progress has been made in recent years in developing better image
captioning models, yet most of them rely on a separate object detector to
extract regional features. Recent vision-language studies are shifting towards
the detector-free trend by leveraging grid representations for more flexible
model training and faster inference speed. However, such development is
primarily focused on image understanding tasks, and remains less investigated
for the caption generation task. In this paper, we are concerned with a
better-performing detector-free image captioning model, and propose a pure
vision transformer-based image captioning model, dubbed as ViTCAP, in which
grid representations are used without extracting the regional features. For
improved performance, we introduce a novel Concept Token Network (CTN) to
predict the semantic concepts and then incorporate them into the end-to-end
captioning. In particular, the CTN is built on the basis of a vision
transformer and is designed to predict the concept tokens through a
classification task, from which the rich semantic information contained greatly
benefits the captioning task. Compared with the previous detector-based models,
ViTCAP drastically simplifies the architectures and at the same time achieves
competitive performance on various challenging image captioning datasets. In
particular, ViTCAP reaches 138.1 CIDEr scores on COCO-caption Karpathy-split,
93.8 and 108.6 CIDEr scores on nocaps, and Google-CC captioning datasets,
respectively.
Related papers
- CLIP-SCGI: Synthesized Caption-Guided Inversion for Person Re-Identification [9.996589403019675]
Person re-identification (ReID) has recently benefited from large pretrained vision-language models such as Contrastive Language-Image Pre-Training (CLIP)
We propose one straightforward solution by leveraging existing image captioning models to generate pseudo captions for person images.
We introduce CLIP-SCGI, a framework that leverages synthesized captions to guide the learning of discriminative and robust representations.
arXiv Detail & Related papers (2024-10-12T06:24:33Z) - Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - Exploring Semantic Relationships for Unpaired Image Captioning [40.401322131624866]
We achieve unpaired image captioning by bridging the vision and the language domains with high-level semantic information.
We propose the Semantic Relationship Explorer, which explores the relationships between semantic concepts for better understanding of the image.
The proposed approach boosts five strong baselines under the paired setting, where the most significant improvement in CIDEr score reaches 8%.
arXiv Detail & Related papers (2021-06-20T09:10:11Z) - Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model.
During the training phase, the modality transition network is optimised by the proposed modality loss.
Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.