Improving Generalization of Image Captioning with Unsupervised Prompt
Learning
- URL: http://arxiv.org/abs/2308.02862v1
- Date: Sat, 5 Aug 2023 12:27:01 GMT
- Title: Improving Generalization of Image Captioning with Unsupervised Prompt
Learning
- Authors: Hongchen Wei, Zhenzhong Chen
- Abstract summary: Generalization of Image Captioning (GeneIC) learns a domain-specific prompt vector for the target domain without requiring annotated data.
GeneIC aligns visual and language modalities with a pre-trained Contrastive Language-Image Pre-Training (CLIP) model.
- Score: 63.26197177542422
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pretrained visual-language models have demonstrated impressive zero-shot
abilities in image captioning, when accompanied by hand-crafted prompts.
Meanwhile, hand-crafted prompts utilize human prior knowledge to guide the
model. However, due to the diversity between different domains, such
hand-crafted prompt that provide invariant prior knowledge may result in mode
collapse for some domains. Some researches attempted to incorporate expert
knowledge and instruction datasets, but the results were costly and led to
hallucinations. In this paper, we propose an unsupervised prompt learning
method to improve Generalization of Image Captioning (GeneIC), which learns a
domain-specific prompt vector for the target domain without requiring annotated
data. GeneIC aligns visual and language modalities with a pre-trained
Contrastive Language-Image Pre-Training (CLIP) model, thus optimizing the
domain-specific prompt vector from two aspects: attribute and semantic
consistency. Specifically, GeneIC first generates attribute-transferred images
with differing attributes, while retaining semantic similarity with original
images. Then, GeneIC uses CLIP to measure the similarity between the images and
the generated sentences. By exploring the variable and invariant features in
the original images and attribute-transferred images, attribute consistency
constrains the attribute change direction of both images and sentences to learn
domain-specific knowledge. The semantic consistency directly measures the
similarity between the generated sentences and images to ensure the accuracy
and comprehensiveness of the generated sentences. Consequently, GeneIC only
optimizes the prompt vectors, which effectively retains the knowledge in the
large model and introduces domain-specific knowledge.
Related papers
- Vision-Language Consistency Guided Multi-modal Prompt Learning for Blind AI Generated Image Quality Assessment [57.07360640784803]
We propose vision-language consistency guided multi-modal prompt learning for blind image quality assessment (AGIQA)
Specifically, we introduce learnable textual and visual prompts in language and vision branches of Contrastive Language-Image Pre-training (CLIP) models.
We design a text-to-image alignment quality prediction task, whose learned vision-language consistency knowledge is used to guide the optimization of the above multi-modal prompts.
arXiv Detail & Related papers (2024-06-24T13:45:31Z) - WIDIn: Wording Image for Domain-Invariant Representation in Single-Source Domain Generalization [63.98650220772378]
We present WIDIn, Wording Images for Domain-Invariant representation, to disentangle discriminative visual representation.
We first estimate the language embedding with fine-grained alignment, which can be used to adaptively identify and then remove domain-specific counterpart.
We show that WIDIn can be applied to both pretrained vision-language models like CLIP, and separately trained uni-modal models like MoCo and BERT.
arXiv Detail & Related papers (2024-05-28T17:46:27Z) - Domain-Controlled Prompt Learning [49.45309818782329]
Existing prompt learning methods often lack domain-awareness or domain-transfer mechanisms.
We propose a textbfDomain-Controlled Prompt Learning for the specific domains.
Our method achieves state-of-the-art performance in specific domain image recognition datasets.
arXiv Detail & Related papers (2023-09-30T02:59:49Z) - Prompt Ensemble Self-training for Open-Vocabulary Domain Adaptation [45.02052030837188]
We study open-vocabulary domain adaptation (OVDA), a new unsupervised domain adaptation framework.
We design a Prompt Ensemble Self-training (PEST) technique that exploits the synergy between vision and language.
PEST outperforms the state-of-the-art consistently across 10 image recognition tasks.
arXiv Detail & Related papers (2023-06-29T03:39:35Z) - Domain-invariant Prototypes for Semantic Segmentation [30.932130453313537]
We present an easy-to-train framework that learns domain-invariant prototypes for domain adaptive semantic segmentation.
Our method involves only one-stage training and does not need to be trained on large-scale un-annotated target images.
arXiv Detail & Related papers (2022-08-12T02:21:05Z) - Retrieval-Augmented Transformer for Image Captioning [51.79146669195357]
We develop an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process.
Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens.
Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality.
arXiv Detail & Related papers (2022-07-26T19:35:49Z) - Marginal Contrastive Correspondence for Guided Image Generation [58.0605433671196]
Exemplar-based image translation establishes dense correspondences between a conditional input and an exemplar from two different domains.
Existing work builds the cross-domain correspondences implicitly by minimizing feature-wise distances across the two domains.
We design a Marginal Contrastive Learning Network (MCL-Net) that explores contrastive learning to learn domain-invariant features for realistic exemplar-based image translation.
arXiv Detail & Related papers (2022-04-01T13:55:44Z) - Understanding Guided Image Captioning Performance across Domains [22.283016988026926]
We present a method to control the concepts that an image caption should focus on, using an additional input called the guiding text.
Our human-evaluation results indicate that attempting in-the-wild guided image captioning requires access to large, unrestricted-domain training datasets.
arXiv Detail & Related papers (2020-12-04T00:05:02Z) - Unsupervised Domain Attention Adaptation Network for Caricature
Attribute Recognition [23.95731281719786]
Caricature attributes provide distinctive facial features to help research in Psychology and Neuroscience.
Unlike the facial photo attribute datasets that have a quantity of annotated images, the annotations of caricature attributes are rare.
We propose a caricature attribute dataset, namely WebCariA, to facility the research in attribute learning of caricatures.
arXiv Detail & Related papers (2020-07-18T06:38:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.