Controllable Image Captioning via Prompting
- URL: http://arxiv.org/abs/2212.01803v1
- Date: Sun, 4 Dec 2022 11:59:31 GMT
- Title: Controllable Image Captioning via Prompting
- Authors: Ning Wang, Jiahao Xie, Jihao Wu, Mingbo Jia, Linlin Li
- Abstract summary: We show that a unified model is qualified to perform well in diverse domains and freely switch among multiple styles.
To be specific, we design a set of prompts to fine-tune the pre-trained image captioner.
In the inference stage, our model is able to generate desired stylized captions by choosing the corresponding prompts.
- Score: 9.935191668056463
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the remarkable progress of image captioning, existing captioners
typically lack the controllable capability to generate desired image captions,
e.g., describing the image in a rough or detailed manner, in a factual or
emotional view, etc. In this paper, we show that a unified model is qualified
to perform well in diverse domains and freely switch among multiple styles.
Such a controllable capability is achieved by embedding the prompt learning
into the image captioning framework. To be specific, we design a set of prompts
to fine-tune the pre-trained image captioner. These prompts allow the model to
absorb stylized data from different domains for joint training, without
performance degradation in each domain. Furthermore, we optimize the prompts
with learnable vectors in the continuous word embedding space, avoiding the
heuristic prompt engineering and meanwhile exhibiting superior performance. In
the inference stage, our model is able to generate desired stylized captions by
choosing the corresponding prompts. Extensive experiments verify the
controllable capability of the proposed method. Notably, we achieve outstanding
performance on two diverse image captioning benchmarks including COCO Karpathy
split and TextCaps using a unified model.
Related papers
- ADS-Cap: A Framework for Accurate and Diverse Stylized Captioning with
Unpaired Stylistic Corpora [37.53634609063878]
We propose a novel framework to generate Accurate and Diverse Stylized Captions (ADS-Cap)
A conditional variational auto-encoder is then used to automatically diverse stylistic patterns in latent space.
Experimental results on two widely used stylized image captioning datasets show that regarding consistency with the image, style accuracy and diversity, ADS-Cap achieves outstanding performances.
arXiv Detail & Related papers (2023-08-02T13:33:20Z) - Visual Captioning at Will: Describing Images and Videos Guided by a Few
Stylized Sentences [49.66987347397398]
Few-Shot Stylized Visual Captioning aims to generate captions in any desired style, using only a few examples as guidance during inference.
We propose a framework called FS-StyleCap for this task, which utilizes a conditional encoder-decoder language model and a visual projection module.
arXiv Detail & Related papers (2023-07-31T04:26:01Z) - Texts as Images in Prompt Tuning for Multi-Label Image Recognition [70.9310322461598]
We advocate that image-text contrastive learning makes it feasible to treat texts as images for prompt tuning and introduce TaI prompting.
Particularly, we apply TaI prompting to multi-label image recognition, where sentences in the wild serve as alternatives to images for prompt tuning.
Our proposed TaI-DPT outperforms zero-shot CLIP by a large margin on multiple benchmarks.
arXiv Detail & Related papers (2022-11-23T07:00:11Z) - Controllable Image Captioning [0.0]
We introduce a novel framework for image captioning which can generate diverse descriptions by capturing the co-dependence between Part-Of-Speech tags and semantics.
We propose a method to generate captions through a Transformer network, which predicts words based on the input Part-Of-Speech tag sequences.
arXiv Detail & Related papers (2022-04-28T07:47:49Z) - CapOnImage: Context-driven Dense-Captioning on Image [13.604173177437536]
We introduce a new task called captioning on image (CapOnImage), which aims to generate dense captions at different locations of the image based on contextual information.
We propose a multi-modal pre-training model with multi-level pre-training tasks that progressively learn the correspondence between texts and image locations.
Compared with other image captioning model variants, our model achieves the best results in both captioning accuracy and diversity aspects.
arXiv Detail & Related papers (2022-04-27T14:40:31Z) - Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [72.60554897161948]
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences.
In this work, we repurpose such models to generate a descriptive text given an image at inference time.
The resulting captions are much less restrictive than those obtained by supervised captioning methods.
arXiv Detail & Related papers (2021-11-29T11:01:49Z) - Caption Enriched Samples for Improving Hateful Memes Detection [78.5136090997431]
The hateful meme challenge demonstrates the difficulty of determining whether a meme is hateful or not.
Both unimodal language models and multimodal vision-language models cannot reach the human level of performance.
arXiv Detail & Related papers (2021-09-22T10:57:51Z) - Contrastive Semantic Similarity Learning for Image Captioning Evaluation
with Intrinsic Auto-encoder [52.42057181754076]
Motivated by the auto-encoder mechanism and contrastive representation learning advances, we propose a learning-based metric for image captioning.
We develop three progressive model structures to learn the sentence level representations.
Experiment results show that our proposed method can align well with the scores generated from other contemporary metrics.
arXiv Detail & Related papers (2021-06-29T12:27:05Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.