Pragmatic Inference with a CLIP Listener for Contrastive Captioning
- URL: http://arxiv.org/abs/2306.08818v1
- Date: Thu, 15 Jun 2023 02:22:28 GMT
- Title: Pragmatic Inference with a CLIP Listener for Contrastive Captioning
- Authors: Jiefu Ou, Benno Krojer and Daniel Fried
- Abstract summary: We propose a method for generating discriminative captions that distinguish target images from very similar alternative distractor images.
Our approach is built on a pragmatic inference procedure that formulates captioning as a reference game between a speaker and a listener.
- Score: 10.669625017690658
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a simple yet effective and robust method for contrastive
captioning: generating discriminative captions that distinguish target images
from very similar alternative distractor images. Our approach is built on a
pragmatic inference procedure that formulates captioning as a reference game
between a speaker, which produces possible captions describing the target, and
a listener, which selects the target given the caption. Unlike previous methods
that derive both speaker and listener distributions from a single captioning
model, we leverage an off-the-shelf CLIP model to parameterize the listener.
Compared with captioner-only pragmatic models, our method benefits from rich
vision language alignment representations from CLIP when reasoning over
distractors. Like previous methods for discriminative captioning, our method
uses a hyperparameter to control the tradeoff between the informativity (how
likely captions are to allow a human listener to discriminate the target image)
and the fluency of the captions. However, we find that our method is
substantially more robust to the value of this hyperparameter than past
methods, which allows us to automatically optimize the captions for
informativity - outperforming past methods for discriminative captioning by 11%
to 15% accuracy in human evaluations
Related papers
- Character-aware audio-visual subtitling in context [58.95580154761008]
This paper presents an improved framework for character-aware audio-visual subtitling in TV shows.
Our approach integrates speech recognition, speaker diarisation, and character recognition, utilising both audio and visual cues.
We validate the method on a dataset with 12 TV shows, demonstrating superior performance in speaker diarisation and character recognition accuracy compared to existing approaches.
arXiv Detail & Related papers (2024-10-14T20:27:34Z) - Fluent and Accurate Image Captioning with a Self-Trained Reward Model [47.213906345208315]
We propose Self-Cap, a captioning approach that relies on a learnable reward model based on self-generated negatives.
Our discriminator is a fine-tuned contrastive image-text model trained to promote caption correctness.
arXiv Detail & Related papers (2024-08-29T18:00:03Z) - Zero-Shot Audio Captioning via Audibility Guidance [57.70351255180495]
We propose three desiderata for captioning audio -- (i) fluency of the generated text, (ii) faithfulness of the generated text to the input audio, and (iii) audibility.
Our method is a zero-shot method, i.e., we do not learn to perform captioning.
We present our results on the AudioCap dataset, demonstrating that audibility guidance significantly enhances performance compared to the baseline.
arXiv Detail & Related papers (2023-09-07T17:45:58Z) - Cross-Domain Image Captioning with Discriminative Finetuning [20.585138136033905]
Fine-tuning an out-of-the-box neural captioner with a self-supervised discriminative communication objective helps to recover a plain, visually descriptive language.
We show that discriminatively finetuned captions are more helpful than either vanilla ClipCap captions or ground-truth captions for human annotators tasked with an image discrimination task.
arXiv Detail & Related papers (2023-04-04T09:33:16Z) - Controllable Image Captioning [0.0]
We introduce a novel framework for image captioning which can generate diverse descriptions by capturing the co-dependence between Part-Of-Speech tags and semantics.
We propose a method to generate captions through a Transformer network, which predicts words based on the input Part-Of-Speech tag sequences.
arXiv Detail & Related papers (2022-04-28T07:47:49Z) - Caption Feature Space Regularization for Audio Captioning [24.40864471466915]
General audio captioning models achieve the one-to-many training by randomly selecting a correlated caption as the ground truth for each audio.
We propose a two-stage framework for audio captioning: (i) in the first stage, via the contrastive learning, we construct a proxy feature space to reduce the distances between captions correlated to the same audio, and (ii) in the second stage, the proxy feature space is utilized as additional supervision to encourage the model to be optimized in the direction that benefits all the correlated captions.
arXiv Detail & Related papers (2022-04-18T17:07:31Z) - Speaker Embedding-aware Neural Diarization for Flexible Number of
Speakers with Textual Information [55.75018546938499]
We propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels.
Our method achieves lower diarization error rate than the target-speaker voice activity detection.
arXiv Detail & Related papers (2021-11-28T12:51:04Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.