Prompt-based Learning for Unpaired Image Captioning
- URL: http://arxiv.org/abs/2205.13125v1
- Date: Thu, 26 May 2022 03:13:43 GMT
- Title: Prompt-based Learning for Unpaired Image Captioning
- Authors: Peipei Zhu, Xiao Wang, Lin Zhu, Zhenglong Sun, Weishi Zheng, Yaowei
Wang, Changwen Chen
- Abstract summary: Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
- Score: 86.44188293709307
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Unpaired Image Captioning (UIC) has been developed to learn image
descriptions from unaligned vision-language sample pairs. Existing schemes
usually adopt the visual concept reward of reinforcement learning to obtain the
alignment between visual concepts and images. However, the cross-domain
alignment is usually weak that severely constrains the overall performance of
these existing schemes. Recent successes of Vision-Language Pre-Trained Models
(VL-PTMs) have triggered the development of prompt-based learning from VL-PTMs.
We present in this paper a novel scheme based on prompt to train the UIC model,
making best use of the powerful generalization ability and abundant
vision-language prior knowledge learned under VL-PTMs. We adopt the CLIP model
for this research in unpaired image captioning. Specifically, the visual images
are taken as input to the prompt generation module, which contains the
pre-trained model as well as one feed-forward layer for prompt extraction.
Then, the input images and generated prompts are aggregated for unpaired
adversarial captioning learning. To further enhance the potential performance
of the captioning, we designed a high-quality pseudo caption filter guided by
the CLIP logits to measure correlations between predicted captions and the
corresponding images. This allows us to improve the captioning model in a
supervised learning manner. Extensive experiments on the COCO and Flickr30K
datasets have been carried out to validate the superiority of the proposed
model. We have achieved the state-of-the-art performance on the COCO dataset,
which outperforms the best UIC model by 1.9% on the BLEU-4 metric. We expect
that the proposed prompt-based UIC model will inspire a new line of research
for the VL-PTMs based captioning.
Related papers
- CLIP-SCGI: Synthesized Caption-Guided Inversion for Person Re-Identification [9.996589403019675]
Person re-identification (ReID) has recently benefited from large pretrained vision-language models such as Contrastive Language-Image Pre-Training (CLIP)
We propose one straightforward solution by leveraging existing image captioning models to generate pseudo captions for person images.
We introduce CLIP-SCGI, a framework that leverages synthesized captions to guide the learning of discriminative and robust representations.
arXiv Detail & Related papers (2024-10-12T06:24:33Z) - Vision-Language Consistency Guided Multi-modal Prompt Learning for Blind AI Generated Image Quality Assessment [57.07360640784803]
We propose vision-language consistency guided multi-modal prompt learning for blind image quality assessment (AGIQA)
Specifically, we introduce learnable textual and visual prompts in language and vision branches of Contrastive Language-Image Pre-training (CLIP) models.
We design a text-to-image alignment quality prediction task, whose learned vision-language consistency knowledge is used to guide the optimization of the above multi-modal prompts.
arXiv Detail & Related papers (2024-06-24T13:45:31Z) - Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation
for Grounding-Based Vision and Language Models [16.4010094165575]
We propose a robust phrase grounding model trained with text-conditioned and text-unconditioned data augmentations.
Inspired by recent masked signal reconstruction, we propose to use pixel-level masking as a novel form of data augmentation.
Our method demonstrates advanced performance over the state-of-the-arts with various metrics.
arXiv Detail & Related papers (2023-11-05T01:14:02Z) - SILC: Improving Vision Language Pretraining with Self-Distillation [113.50400246862056]
We introduce SILC, a novel framework for vision language pretraining.
SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation.
We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation.
arXiv Detail & Related papers (2023-10-20T08:44:47Z) - Zero-shot Visual Question Answering with Language Model Feedback [83.65140324876536]
We propose a language model guided captioning approach, LAMOC, for knowledge-based visual question answering (VQA)
Our approach employs the generated captions by a captioning model as the context of an answer prediction model, which is a Pre-trained Language model (PLM)
arXiv Detail & Related papers (2023-05-26T15:04:20Z) - Paraphrasing Is All You Need for Novel Object Captioning [126.66301869607656]
Novel object captioning (NOC) aims to describe images containing objects without observing their ground truth captions during training.
We present Paraphrasing-to-Captioning (P2C), a two-stage learning framework for NOC, which wouldally optimize the output captions via paraphrasing.
arXiv Detail & Related papers (2022-09-25T22:56:04Z) - Injecting Semantic Concepts into End-to-End Image Captioning [61.41154537334627]
We propose a pure vision transformer-based image captioning model, dubbed as ViTCAP, in which grid representations are used without extracting the regional features.
For improved performance, we introduce a novel Concept Token Network (CTN) to predict the semantic concepts and then incorporate them into the end-to-end captioning.
In particular, the CTN is built on the basis of a vision transformer and is designed to predict the concept tokens through a classification task.
arXiv Detail & Related papers (2021-12-09T22:05:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.