Related papers: Improving Reference-based Distinctive Image Captioning with Contrastive Rewards

Improving Reference-based Distinctive Image Captioning with Contrastive Rewards

URL: http://arxiv.org/abs/2306.14259v1
Date: Sun, 25 Jun 2023 14:37:13 GMT
Title: Improving Reference-based Distinctive Image Captioning with Contrastive Rewards
Authors: Yangjun Mao, Jun Xiao, Dong Zhang, Meng Cao, Jian Shao, Yueting Zhuang, Long Chen
Abstract summary: A recent DIC method proposes to generate distinctive captions by comparing the target image with a set of semantic-similar reference images. We propose two new Ref-DIC benchmarks and develop a Transformer-based Ref-DIC baseline TransDIC. For more trustworthy benchmarking, we propose a new evaluation metric named DisCIDEr for Ref-DIC.
Score: 52.406331702017596
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Distinctive Image Captioning (DIC) -- generating distinctive captions that describe the unique details of a target image -- has received considerable attention over the last few years. A recent DIC method proposes to generate distinctive captions by comparing the target image with a set of semantic-similar reference images, i.e., reference-based DIC (Ref-DIC). It aims to force the generated captions to distinguish between the target image and the reference image. To ensure Ref-DIC models really perceive the unique objects (or attributes) in target images, we propose two new Ref-DIC benchmarks and develop a Transformer-based Ref-DIC baseline TransDIC. The model only extracts visual features from the target image, but also encodes the differences between objects in the target and reference images. Taking one step further, we propose a stronger TransDIC++, which consists of an extra contrastive learning module to make full use of the reference images. This new module is model-agnostic, which can be easily incorporated into various Ref-DIC architectures. Finally, for more trustworthy benchmarking, we propose a new evaluation metric named DisCIDEr for Ref-DIC, which evaluates both the accuracy and distinctiveness of the generated captions. Experimental results demonstrate that our TransDIC++ can generate distinctive captions. Besides, it outperforms several state-of-the-art models on the two new benchmarks over different metrics.

Related papers

Group-based Distinctive Image Captioning with Memory Difference Encoding and Attention [62.246950834745796]
Group-based Differential Distinctive Captioning Method. Group-based Differential Memory Attention (GDMA) module. New evaluation metric, the Distinctive Word Rate (DisWordRate)
arXiv Detail & Related papers (2025-04-03T11:19:51Z)
Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning [56.31096024472269]
We introduce DeCapBench along with a novel metric, DCScore, specifically designed for detailed captioning tasks. DCScore evaluates hallucinations and fine-grained comprehensiveness by deconstructing responses into the smallest self-sufficient units. DeCapBench exhibits a high correlation with VLM arena results on descriptive tasks, surpassing existing benchmarks for vision-language models.
arXiv Detail & Related papers (2025-03-10T22:53:56Z)
Evaluating Image Caption via Cycle-consistent Text-to-Image Generation [24.455344211552692]
We propose CAMScore, a reference-free automatic evaluation metric for image captioning models. To circumvent the aforementioned modality gap, CAMScore utilizes a text-to-image model to generate images from captions and subsequently evaluates these generated images against the original images. Experiment results show that CAMScore achieves a superior correlation with human judgments compared to existing reference-based and reference-free metrics.
arXiv Detail & Related papers (2025-01-07T06:35:34Z)
Dual-Image Enhanced CLIP for Zero-Shot Anomaly Detection [58.228940066769596]
We introduce a Dual-Image Enhanced CLIP approach, leveraging a joint vision-language scoring system. Our methods process pairs of images, utilizing each as a visual reference for the other, thereby enriching the inference process with visual context. Our approach significantly exploits the potential of vision-language joint anomaly detection and demonstrates comparable performance with current SOTA methods across various datasets.
arXiv Detail & Related papers (2024-05-08T03:13:20Z)
Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval [50.72924579220149]
Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification. Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target image. We propose a new semi-supervised CIR approach where we search for a reference and its related target images in auxiliary data.
arXiv Detail & Related papers (2024-04-23T21:00:22Z)
Diversified in-domain synthesis with efficient fine-tuning for few-shot classification [64.86872227580866]
Few-shot image classification aims to learn an image classifier using only a small set of labeled examples per class. We propose DISEF, a novel approach which addresses the generalization challenge in few-shot learning using synthetic data. We validate our method in ten different benchmarks, consistently outperforming baselines and establishing a new state-of-the-art for few-shot classification.
arXiv Detail & Related papers (2023-12-05T17:18:09Z)
DisCLIP: Open-Vocabulary Referring Expression Generation [37.789850573203694]
We build on CLIP, a large-scale visual-semantic model, to guide an LLM to generate a contextual description of a target concept in an image. We measure the quality of the generated text by evaluating the capability of a receiver model to accurately identify the described object within the scene. Our results highlight the potential of using pre-trained visual-semantic models for generating high-quality contextual descriptions.
arXiv Detail & Related papers (2023-05-30T15:13:17Z)
Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation [47.40949434032489]
We propose a new contrastive-based evaluation metric for image captioning, namely Positive-Augmented Contrastive learning Score (PAC-S) PAC-S unifies the learning of a contrastive visual-semantic space with the addition of generated images and text on curated data. Experiments spanning several datasets demonstrate that our new metric achieves the highest correlation with human judgments on both images and videos.
arXiv Detail & Related papers (2023-03-21T18:03:14Z)
Rethinking the Reference-based Distinctive Image Captioning [17.724543105544935]
A recent work proposes to generate distinctive captions by comparing the target image with a set of semantic-similar reference images. We develop a strong Transformer-based Ref-DIC baseline, dubbed as TransDIC. For more trustworthy benchmarking, we propose a new evaluation metric named DisCIDEr for Ref-DIC.
arXiv Detail & Related papers (2022-07-22T14:49:54Z)
IR-GAN: Image Manipulation with Linguistic Instruction by Increment Reasoning [110.7118381246156]
Increment Reasoning Generative Adversarial Network (IR-GAN) aims to reason consistency between visual increment in images and semantic increment in instructions. First, we introduce the word-level and instruction-level instruction encoders to learn user's intention from history-correlated instructions as semantic increment. Second, we embed the representation of semantic increment into that of source image for generating target image, where source image plays the role of referring auxiliary.
arXiv Detail & Related papers (2022-04-02T07:48:39Z)
Two-stage Visual Cues Enhancement Network for Referring Image Segmentation [89.49412325699537]
Referring Image (RIS) aims at segmenting the target object from an image referred by one given natural language expression. In this paper, we tackle this problem by devising a Two-stage Visual cues enhancement Network (TV-Net) Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image.
arXiv Detail & Related papers (2021-10-09T02:53:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.