Distinctive Image Captioning via CLIP Guided Group Optimization
- URL: http://arxiv.org/abs/2208.04254v3
- Date: Thu, 11 Aug 2022 16:52:54 GMT
- Title: Distinctive Image Captioning via CLIP Guided Group Optimization
- Authors: Youyuan Zhang, Jiuniu Wang, Hao Wu, Wenjia Xu
- Abstract summary: In this paper, we focus on generating the distinctive captions that can distinguish the target image from other similar images.
We introduce a series of metrics that use large-scale vision-language pre-training model CLIP to quantify the distinctiveness.
We propose a simple and effective training strategy which trains the model by comparing target image with similar image group and optimizing the group embedding gap.
- Score: 13.102953452346297
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Image captioning models are usually trained according to human annotated
ground-truth captions, which could generate accurate but generic captions. In
this paper, we focus on generating the distinctive captions that can
distinguish the target image from other similar images. To evaluate the
distinctiveness of captions, we introduce a series of metrics that use
large-scale vision-language pre-training model CLIP to quantify the
distinctiveness. To further improve the distinctiveness of captioning models,
we propose a simple and effective training strategy which trains the model by
comparing target image with similar image group and optimizing the group
embedding gap. Extensive experiments are conducted on various baseline models
to demonstrate the wide applicability of our strategy and the consistency of
metric results with human evaluation. By comparing the performance of our best
model with existing state-of-the-art models, we claim that our model achieves
new state-of-the-art towards distinctiveness objective.
Related papers
- CLIP-SCGI: Synthesized Caption-Guided Inversion for Person Re-Identification [9.996589403019675]
Person re-identification (ReID) has recently benefited from large pretrained vision-language models such as Contrastive Language-Image Pre-Training (CLIP)
We propose one straightforward solution by leveraging existing image captioning models to generate pseudo captions for person images.
We introduce CLIP-SCGI, a framework that leverages synthesized captions to guide the learning of discriminative and robust representations.
arXiv Detail & Related papers (2024-10-12T06:24:33Z) - Fluent and Accurate Image Captioning with a Self-Trained Reward Model [47.213906345208315]
We propose Self-Cap, a captioning approach that relies on a learnable reward model based on self-generated negatives.
Our discriminator is a fine-tuned contrastive image-text model trained to promote caption correctness.
arXiv Detail & Related papers (2024-08-29T18:00:03Z) - Enhancing Large Vision Language Models with Self-Training on Image Comprehension [131.14381425260706]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension.
First, the model self-constructs a preference for image descriptions using unlabeled images.
To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z) - Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation [87.50120181861362]
VisionPrefer is a high-quality and fine-grained preference dataset that captures multiple preference aspects.
We train a reward model VP-Score over VisionPrefer to guide the training of text-to-image generative models and the preference prediction accuracy of VP-Score is comparable to human annotators.
arXiv Detail & Related papers (2024-04-23T14:53:15Z) - Diversified in-domain synthesis with efficient fine-tuning for few-shot
classification [64.86872227580866]
Few-shot image classification aims to learn an image classifier using only a small set of labeled examples per class.
We propose DISEF, a novel approach which addresses the generalization challenge in few-shot learning using synthetic data.
We validate our method in ten different benchmarks, consistently outperforming baselines and establishing a new state-of-the-art for few-shot classification.
arXiv Detail & Related papers (2023-12-05T17:18:09Z) - Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP [84.90129481336659]
We study transferrable representation learning underlying CLIP and demonstrate how features from different modalities get aligned.
Inspired by our analysis, we propose a new CLIP-type approach, which achieves better performance than CLIP and other state-of-the-art methods on benchmark datasets.
arXiv Detail & Related papers (2023-10-02T06:41:30Z) - ECO: Ensembling Context Optimization for Vision-Language Models [22.32996522125523]
We show that learning diverse and possibly shorter contexts improves considerably and consistently the results.
We report better few-shot capabilities with no additional cost at inference time.
arXiv Detail & Related papers (2023-07-26T09:31:06Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - Group-based Distinctive Image Captioning with Memory Attention [45.763534774116856]
Group-based Distinctive Captioning Model (GdisCap) improves the distinctiveness of image captions.
New evaluation metric, distinctive word rate (DisWordRate) is proposed to measure the distinctiveness of captions.
arXiv Detail & Related papers (2021-08-20T12:46:36Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.