A-CAP: Anticipation Captioning with Commonsense Knowledge
- URL: http://arxiv.org/abs/2304.06602v1
- Date: Thu, 13 Apr 2023 15:10:47 GMT
- Title: A-CAP: Anticipation Captioning with Commonsense Knowledge
- Authors: Duc Minh Vo, Quoc-An Luong, Akihiro Sugimoto, Hideki Nakayama
- Abstract summary: We propose a model called A-CAP, which incorporates commonsense knowledge into a pre-trained vision-language model, allowing it to anticipate the caption.
A-CAP outperforms other image captioning methods and establishes a strong baseline for anticipation captioning.
- Score: 28.572510380683724
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans possess the capacity to reason about the future based on a sparse
collection of visual cues acquired over time. In order to emulate this ability,
we introduce a novel task called Anticipation Captioning, which generates a
caption for an unseen oracle image using a sparsely temporally-ordered set of
images. To tackle this new task, we propose a model called A-CAP, which
incorporates commonsense knowledge into a pre-trained vision-language model,
allowing it to anticipate the caption. Through both qualitative and
quantitative evaluations on a customized visual storytelling dataset, A-CAP
outperforms other image captioning methods and establishes a strong baseline
for anticipation captioning. We also address the challenges inherent in this
task.
Related papers
- Improving Image Captioning Descriptiveness by Ranking and LLM-based
Fusion [17.99150939602917]
State-of-The-Art (SoTA) image captioning models often rely on the Microsoft COCO (MS-COCO) dataset for training.
We present a novel approach to address previous challenges by showcasing how captions generated from different SoTA models can be effectively fused.
arXiv Detail & Related papers (2023-06-20T15:13:02Z) - FuseCap: Leveraging Large Language Models for Enriched Fused Image
Captions [11.274127953112574]
We propose an automated approach to augmenting existing captions with visual details using "frozen" vision experts.
Our proposed method, FuseCap, fuses the outputs of such vision experts with the original captions using a large language model.
We release this large-scale dataset of enriched image-caption pairs for the community.
arXiv Detail & Related papers (2023-05-28T13:16:03Z) - Retrieval-Augmented Transformer for Image Captioning [51.79146669195357]
We develop an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process.
Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens.
Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality.
arXiv Detail & Related papers (2022-07-26T19:35:49Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - Injecting Semantic Concepts into End-to-End Image Captioning [61.41154537334627]
We propose a pure vision transformer-based image captioning model, dubbed as ViTCAP, in which grid representations are used without extracting the regional features.
For improved performance, we introduce a novel Concept Token Network (CTN) to predict the semantic concepts and then incorporate them into the end-to-end captioning.
In particular, the CTN is built on the basis of a vision transformer and is designed to predict the concept tokens through a classification task.
arXiv Detail & Related papers (2021-12-09T22:05:05Z) - Learning to Select: A Fully Attentive Approach for Novel Object
Captioning [48.497478154384105]
Novel object captioning (NOC) has recently emerged as a paradigm to test captioning models on objects which are unseen during the training phase.
We present a novel approach for NOC that learns to select the most relevant objects of an image, regardless of their adherence to the training set.
Our architecture is fully-attentive and end-to-end trainable, also when incorporating constraints.
arXiv Detail & Related papers (2021-06-02T19:11:21Z) - VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning [128.6138588412508]
This paper presents VIsual VOcabulary pretraining (VIVO) that performs pre-training in the absence of caption annotations.
Our model can not only generate fluent image captions that describe novel objects, but also identify the locations of these objects.
arXiv Detail & Related papers (2020-09-28T23:20:02Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z) - Egoshots, an ego-vision life-logging dataset and semantic fidelity
metric to evaluate diversity in image captioning models [63.11766263832545]
We present a new image captioning dataset, Egoshots, consisting of 978 real life images with no captions.
In order to evaluate the quality of the generated captions, we propose a new image captioning metric, object based Semantic Fidelity (SF)
arXiv Detail & Related papers (2020-03-26T04:43:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.