Improving Reference-based Distinctive Image Captioning with Contrastive
Rewards
- URL: http://arxiv.org/abs/2306.14259v1
- Date: Sun, 25 Jun 2023 14:37:13 GMT
- Title: Improving Reference-based Distinctive Image Captioning with Contrastive
Rewards
- Authors: Yangjun Mao, Jun Xiao, Dong Zhang, Meng Cao, Jian Shao, Yueting
Zhuang, Long Chen
- Abstract summary: A recent DIC method proposes to generate distinctive captions by comparing the target image with a set of semantic-similar reference images.
We propose two new Ref-DIC benchmarks and develop a Transformer-based Ref-DIC baseline TransDIC.
For more trustworthy benchmarking, we propose a new evaluation metric named DisCIDEr for Ref-DIC.
- Score: 52.406331702017596
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Distinctive Image Captioning (DIC) -- generating distinctive captions that
describe the unique details of a target image -- has received considerable
attention over the last few years. A recent DIC method proposes to generate
distinctive captions by comparing the target image with a set of
semantic-similar reference images, i.e., reference-based DIC (Ref-DIC). It aims
to force the generated captions to distinguish between the target image and the
reference image. To ensure Ref-DIC models really perceive the unique objects
(or attributes) in target images, we propose two new Ref-DIC benchmarks and
develop a Transformer-based Ref-DIC baseline TransDIC. The model only extracts
visual features from the target image, but also encodes the differences between
objects in the target and reference images. Taking one step further, we propose
a stronger TransDIC++, which consists of an extra contrastive learning module
to make full use of the reference images. This new module is model-agnostic,
which can be easily incorporated into various Ref-DIC architectures. Finally,
for more trustworthy benchmarking, we propose a new evaluation metric named
DisCIDEr for Ref-DIC, which evaluates both the accuracy and distinctiveness of
the generated captions. Experimental results demonstrate that our TransDIC++
can generate distinctive captions. Besides, it outperforms several
state-of-the-art models on the two new benchmarks over different metrics.
Related papers
- Dual-Image Enhanced CLIP for Zero-Shot Anomaly Detection [58.228940066769596]
We introduce a Dual-Image Enhanced CLIP approach, leveraging a joint vision-language scoring system.
Our methods process pairs of images, utilizing each as a visual reference for the other, thereby enriching the inference process with visual context.
Our approach significantly exploits the potential of vision-language joint anomaly detection and demonstrates comparable performance with current SOTA methods across various datasets.
arXiv Detail & Related papers (2024-05-08T03:13:20Z) - Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval [50.72924579220149]
Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification.
Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target image.
We propose a new semi-supervised CIR approach where we search for a reference and its related target images in auxiliary data.
arXiv Detail & Related papers (2024-04-23T21:00:22Z) - Diversified in-domain synthesis with efficient fine-tuning for few-shot
classification [64.86872227580866]
Few-shot image classification aims to learn an image classifier using only a small set of labeled examples per class.
We propose DISEF, a novel approach which addresses the generalization challenge in few-shot learning using synthetic data.
We validate our method in ten different benchmarks, consistently outperforming baselines and establishing a new state-of-the-art for few-shot classification.
arXiv Detail & Related papers (2023-12-05T17:18:09Z) - DisCLIP: Open-Vocabulary Referring Expression Generation [37.789850573203694]
We build on CLIP, a large-scale visual-semantic model, to guide an LLM to generate a contextual description of a target concept in an image.
We measure the quality of the generated text by evaluating the capability of a receiver model to accurately identify the described object within the scene.
Our results highlight the potential of using pre-trained visual-semantic models for generating high-quality contextual descriptions.
arXiv Detail & Related papers (2023-05-30T15:13:17Z) - Positive-Augmented Contrastive Learning for Image and Video Captioning
Evaluation [47.40949434032489]
We propose a new contrastive-based evaluation metric for image captioning, namely Positive-Augmented Contrastive learning Score (PAC-S)
PAC-S unifies the learning of a contrastive visual-semantic space with the addition of generated images and text on curated data.
Experiments spanning several datasets demonstrate that our new metric achieves the highest correlation with human judgments on both images and videos.
arXiv Detail & Related papers (2023-03-21T18:03:14Z) - Rethinking the Reference-based Distinctive Image Captioning [17.724543105544935]
A recent work proposes to generate distinctive captions by comparing the target image with a set of semantic-similar reference images.
We develop a strong Transformer-based Ref-DIC baseline, dubbed as TransDIC.
For more trustworthy benchmarking, we propose a new evaluation metric named DisCIDEr for Ref-DIC.
arXiv Detail & Related papers (2022-07-22T14:49:54Z) - IR-GAN: Image Manipulation with Linguistic Instruction by Increment
Reasoning [110.7118381246156]
Increment Reasoning Generative Adversarial Network (IR-GAN) aims to reason consistency between visual increment in images and semantic increment in instructions.
First, we introduce the word-level and instruction-level instruction encoders to learn user's intention from history-correlated instructions as semantic increment.
Second, we embed the representation of semantic increment into that of source image for generating target image, where source image plays the role of referring auxiliary.
arXiv Detail & Related papers (2022-04-02T07:48:39Z) - Two-stage Visual Cues Enhancement Network for Referring Image
Segmentation [89.49412325699537]
Referring Image (RIS) aims at segmenting the target object from an image referred by one given natural language expression.
In this paper, we tackle this problem by devising a Two-stage Visual cues enhancement Network (TV-Net)
Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image.
arXiv Detail & Related papers (2021-10-09T02:53:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.