Partially-supervised novel object captioning leveraging context from
paired data
- URL: http://arxiv.org/abs/2109.05115v1
- Date: Fri, 10 Sep 2021 21:31:42 GMT
- Title: Partially-supervised novel object captioning leveraging context from
paired data
- Authors: Shashank Bujimalla, Mahesh Subedar, Omesh Tickoo
- Abstract summary: We create synthetic paired captioning data for novel objects by leveraging context from existing image-caption pairs.
We further re-use these partially paired images with novel objects to create pseudo-label captions.
Our approach achieves state-of-the-art results on held-out MS COCO out-of-domain test split.
- Score: 11.215352918313577
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose an approach to improve image captioning solutions
for images with novel objects that do not have caption labels in the training
dataset. Our approach is agnostic to model architecture, and primarily focuses
on training technique that uses existing fully paired image-caption data and
the images with only the novel object detection labels (partially paired data).
We create synthetic paired captioning data for these novel objects by
leveraging context from existing image-caption pairs. We further re-use these
partially paired images with novel objects to create pseudo-label captions that
are used to fine-tune the captioning model. Using a popular captioning model
(Up-Down) as baseline, our approach achieves state-of-the-art results on
held-out MS COCO out-of-domain test split, and improves F1 metric and CIDEr for
novel object images by 75.8 and 26.6 points respectively, compared to baseline
model that does not use partially paired images during training.
Related papers
- Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos.
We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text.
We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z) - The Solution for the CVPR2023 NICE Image Captioning Challenge [11.37047794237074]
We present our solution to the New frontiers for Zero-shot Image Captioning Challenge.
This challenge includes a larger new variety of visual concepts from many domains.
For the data level, we collect external training data from Laion-5B.
For the model level, we use OFA, a large-scale visual-language pre-training model.
arXiv Detail & Related papers (2023-10-10T09:09:41Z) - Visually-Aware Context Modeling for News Image Captioning [54.31708859631821]
News Image Captioning aims to create captions from news articles and images.
We propose a face-naming module for learning better name embeddings.
We use CLIP to retrieve sentences that are semantically close to the image.
arXiv Detail & Related papers (2023-08-16T12:39:39Z) - Improving Image Captioning Descriptiveness by Ranking and LLM-based
Fusion [17.99150939602917]
State-of-The-Art (SoTA) image captioning models often rely on the Microsoft COCO (MS-COCO) dataset for training.
We present a novel approach to address previous challenges by showcasing how captions generated from different SoTA models can be effectively fused.
arXiv Detail & Related papers (2023-06-20T15:13:02Z) - Paraphrasing Is All You Need for Novel Object Captioning [126.66301869607656]
Novel object captioning (NOC) aims to describe images containing objects without observing their ground truth captions during training.
We present Paraphrasing-to-Captioning (P2C), a two-stage learning framework for NOC, which wouldally optimize the output captions via paraphrasing.
arXiv Detail & Related papers (2022-09-25T22:56:04Z) - Guiding Attention using Partial-Order Relationships for Image Captioning [2.620091916172863]
A guided attention network mechanism exploits the relationship between the visual scene and text-descriptions.
A pairwise ranking objective is used for training this embedding space which allows similar images, topics and captions in the shared semantic space.
The experimental results based on MSCOCO dataset shows the competitiveness of our approach.
arXiv Detail & Related papers (2022-04-15T14:22:09Z) - Learning to Select: A Fully Attentive Approach for Novel Object
Captioning [48.497478154384105]
Novel object captioning (NOC) has recently emerged as a paradigm to test captioning models on objects which are unseen during the training phase.
We present a novel approach for NOC that learns to select the most relevant objects of an image, regardless of their adherence to the training set.
Our architecture is fully-attentive and end-to-end trainable, also when incorporating constraints.
arXiv Detail & Related papers (2021-06-02T19:11:21Z) - Iconographic Image Captioning for Artworks [2.3859169601259342]
This work utilizes a novel large-scale dataset of artwork images annotated with concepts from the Iconclass classification system designed for art and iconography.
The annotations are processed into clean textual description to create a dataset suitable for training a deep neural network model on the image captioning task.
A transformer-based vision-language pre-trained model is fine-tuned using the artwork image dataset.
The quality of the generated captions and the model's capacity to generalize to new data is explored by employing the model on a new collection of paintings and performing an analysis of the relation between commonly generated captions and the artistic genre.
arXiv Detail & Related papers (2021-02-07T23:11:33Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z) - VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning [128.6138588412508]
This paper presents VIsual VOcabulary pretraining (VIVO) that performs pre-training in the absence of caption annotations.
Our model can not only generate fluent image captions that describe novel objects, but also identify the locations of these objects.
arXiv Detail & Related papers (2020-09-28T23:20:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.