Learning to Select: A Fully Attentive Approach for Novel Object
Captioning
- URL: http://arxiv.org/abs/2106.01424v1
- Date: Wed, 2 Jun 2021 19:11:21 GMT
- Title: Learning to Select: A Fully Attentive Approach for Novel Object
Captioning
- Authors: Marco Cagrandi, Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi,
Rita Cucchiara
- Abstract summary: Novel object captioning (NOC) has recently emerged as a paradigm to test captioning models on objects which are unseen during the training phase.
We present a novel approach for NOC that learns to select the most relevant objects of an image, regardless of their adherence to the training set.
Our architecture is fully-attentive and end-to-end trainable, also when incorporating constraints.
- Score: 48.497478154384105
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image captioning models have lately shown impressive results when applied to
standard datasets. Switching to real-life scenarios, however, constitutes a
challenge due to the larger variety of visual concepts which are not covered in
existing training sets. For this reason, novel object captioning (NOC) has
recently emerged as a paradigm to test captioning models on objects which are
unseen during the training phase. In this paper, we present a novel approach
for NOC that learns to select the most relevant objects of an image, regardless
of their adherence to the training set, and to constrain the generative process
of a language model accordingly. Our architecture is fully-attentive and
end-to-end trainable, also when incorporating constraints. We perform
experiments on the held-out COCO dataset, where we demonstrate improvements
over the state of the art, both in terms of adaptability to novel objects and
caption quality.
Related papers
- OLIVE: Object Level In-Context Visual Embeddings [8.168219870640318]
We propose a novel method to prompt large language models with in-context visual object vectors.
This eliminates the necessity of fusing a lengthy array of image patch features and significantly speeds up training.
Our experiments reveal that our method achieves competitive referring object classification and captioning performance.
arXiv Detail & Related papers (2024-06-02T21:36:31Z) - Training-Free Consistent Text-to-Image Generation [80.4814768762066]
Text-to-image models can portray the same subject across diverse prompts.
Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects.
We present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model.
arXiv Detail & Related papers (2024-02-05T18:42:34Z) - Expedited Training of Visual Conditioned Language Generation via
Redundancy Reduction [61.16125290912494]
$textEVL_textGen$ is a framework designed for the pre-training of visually conditioned language generation models.
We show that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance.
arXiv Detail & Related papers (2023-10-05T03:40:06Z) - Improving Image Captioning Descriptiveness by Ranking and LLM-based
Fusion [17.99150939602917]
State-of-The-Art (SoTA) image captioning models often rely on the Microsoft COCO (MS-COCO) dataset for training.
We present a novel approach to address previous challenges by showcasing how captions generated from different SoTA models can be effectively fused.
arXiv Detail & Related papers (2023-06-20T15:13:02Z) - Paraphrasing Is All You Need for Novel Object Captioning [126.66301869607656]
Novel object captioning (NOC) aims to describe images containing objects without observing their ground truth captions during training.
We present Paraphrasing-to-Captioning (P2C), a two-stage learning framework for NOC, which wouldally optimize the output captions via paraphrasing.
arXiv Detail & Related papers (2022-09-25T22:56:04Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - CaMEL: Mean Teacher Learning for Image Captioning [47.9708610052655]
We present CaMEL, a novel Transformer-based architecture for image captioning.
Our proposed approach leverages the interaction of two interconnected language models that learn from each other during the training phase.
Experimentally, we assess the effectiveness of the proposed solution on the COCO dataset and in conjunction with different visual feature extractors.
arXiv Detail & Related papers (2022-02-21T19:04:46Z) - Partially-supervised novel object captioning leveraging context from
paired data [11.215352918313577]
We create synthetic paired captioning data for novel objects by leveraging context from existing image-caption pairs.
We further re-use these partially paired images with novel objects to create pseudo-label captions.
Our approach achieves state-of-the-art results on held-out MS COCO out-of-domain test split.
arXiv Detail & Related papers (2021-09-10T21:31:42Z) - VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning [128.6138588412508]
This paper presents VIsual VOcabulary pretraining (VIVO) that performs pre-training in the absence of caption annotations.
Our model can not only generate fluent image captions that describe novel objects, but also identify the locations of these objects.
arXiv Detail & Related papers (2020-09-28T23:20:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.