Paraphrasing Is All You Need for Novel Object Captioning
- URL: http://arxiv.org/abs/2209.12343v1
- Date: Sun, 25 Sep 2022 22:56:04 GMT
- Title: Paraphrasing Is All You Need for Novel Object Captioning
- Authors: Cheng-Fu Yang, Yao-Hung Hubert Tsai, Wan-Cyuan Fan, Ruslan
Salakhutdinov, Louis-Philippe Morency, Yu-Chiang Frank Wang
- Abstract summary: Novel object captioning (NOC) aims to describe images containing objects without observing their ground truth captions during training.
We present Paraphrasing-to-Captioning (P2C), a two-stage learning framework for NOC, which wouldally optimize the output captions via paraphrasing.
- Score: 126.66301869607656
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Novel object captioning (NOC) aims to describe images containing objects
without observing their ground truth captions during training. Due to the
absence of caption annotation, captioning models cannot be directly optimized
via sequence-to-sequence training or CIDEr optimization. As a result, we
present Paraphrasing-to-Captioning (P2C), a two-stage learning framework for
NOC, which would heuristically optimize the output captions via paraphrasing.
With P2C, the captioning model first learns paraphrasing from a language model
pre-trained on text-only corpus, allowing expansion of the word bank for
improving linguistic fluency. To further enforce the output caption
sufficiently describing the visual content of the input image, we perform
self-paraphrasing for the captioning model with fidelity and adequacy
objectives introduced. Since no ground truth captions are available for novel
object images during training, our P2C leverages cross-modality (image-text)
association modules to ensure the above caption characteristics can be properly
preserved. In the experiments, we not only show that our P2C achieves
state-of-the-art performances on nocaps and COCO Caption datasets, we also
verify the effectiveness and flexibility of our learning framework by replacing
language and cross-modality association models for NOC. Implementation details
and code are available in the supplementary materials.
Related papers
- CLIP-SCGI: Synthesized Caption-Guided Inversion for Person Re-Identification [9.996589403019675]
Person re-identification (ReID) has recently benefited from large pretrained vision-language models such as Contrastive Language-Image Pre-Training (CLIP)
We propose one straightforward solution by leveraging existing image captioning models to generate pseudo captions for person images.
We introduce CLIP-SCGI, a framework that leverages synthesized captions to guide the learning of discriminative and robust representations.
arXiv Detail & Related papers (2024-10-12T06:24:33Z) - What Makes for Good Image Captions? [50.48589893443939]
Our framework posits that good image captions should balance three key aspects: informationally sufficient, minimally redundant, and readily comprehensible by humans.
We introduce the Pyramid of Captions (PoCa) method, which generates enriched captions by integrating local and global visual information.
arXiv Detail & Related papers (2024-05-01T12:49:57Z) - Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos.
We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text.
We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z) - Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via
Text-Only Training [14.340740609933437]
We propose a novel zero-shot image captioning framework with text-only training to reduce the modality gap.
In particular, we introduce a subregion feature aggregation to leverage local region information.
We extend our framework to build a zero-shot VQA pipeline, demonstrating its generality.
arXiv Detail & Related papers (2024-01-04T16:43:46Z) - Improving Image Captioning Descriptiveness by Ranking and LLM-based
Fusion [17.99150939602917]
State-of-The-Art (SoTA) image captioning models often rely on the Microsoft COCO (MS-COCO) dataset for training.
We present a novel approach to address previous challenges by showcasing how captions generated from different SoTA models can be effectively fused.
arXiv Detail & Related papers (2023-06-20T15:13:02Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - Structural and Functional Decomposition for Personality Image Captioning
in a Communication Game [53.74847926974122]
Personality image captioning (PIC) aims to describe an image with a natural language caption given a personality trait.
We introduce a novel formulation for PIC based on a communication game between a speaker and a listener.
arXiv Detail & Related papers (2020-11-17T10:19:27Z) - VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning [128.6138588412508]
This paper presents VIsual VOcabulary pretraining (VIVO) that performs pre-training in the absence of caption annotations.
Our model can not only generate fluent image captions that describe novel objects, but also identify the locations of these objects.
arXiv Detail & Related papers (2020-09-28T23:20:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.