Simple Token-Level Confidence Improves Caption Correctness
- URL: http://arxiv.org/abs/2305.07021v1
- Date: Thu, 11 May 2023 17:58:17 GMT
- Title: Simple Token-Level Confidence Improves Caption Correctness
- Authors: Suzanne Petryk, Spencer Whitehead, Joseph E. Gonzalez, Trevor Darrell,
Anna Rohrbach, Marcus Rohrbach
- Abstract summary: Token-Level Confidence, or TLC, is a simple yet surprisingly effective method to assess caption correctness.
We fine-tune a vision-language model on image captioning, input an image and proposed caption to the model, and aggregate token confidences over words or sequences to estimate image-caption consistency.
- Score: 117.33497608933169
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The ability to judge whether a caption correctly describes an image is a
critical part of vision-language understanding. However, state-of-the-art
models often misinterpret the correctness of fine-grained details, leading to
errors in outputs such as hallucinating objects in generated captions or poor
compositional reasoning. In this work, we explore Token-Level Confidence, or
TLC, as a simple yet surprisingly effective method to assess caption
correctness. Specifically, we fine-tune a vision-language model on image
captioning, input an image and proposed caption to the model, and aggregate
either algebraic or learned token confidences over words or sequences to
estimate image-caption consistency. Compared to sequence-level scores from
pretrained models, TLC with algebraic confidence measures achieves a relative
improvement in accuracy by 10% on verb understanding in SVO-Probes and
outperforms prior state-of-the-art in image and group scores for compositional
reasoning in Winoground by a relative 37% and 9%, respectively. When training
data are available, a learned confidence estimator provides further improved
performance, reducing object hallucination rates in MS COCO Captions by a
relative 30% over the original model and setting a new state-of-the-art.
Related papers
- Fluent and Accurate Image Captioning with a Self-Trained Reward Model [47.213906345208315]
We propose Self-Cap, a captioning approach that relies on a learnable reward model based on self-generated negatives.
Our discriminator is a fine-tuned contrastive image-text model trained to promote caption correctness.
arXiv Detail & Related papers (2024-08-29T18:00:03Z) - Image-Caption Encoding for Improving Zero-Shot Generalization [12.906307770270026]
We show that when an OOD data point is misclassified, the correct class can be typically found in the Top-K predicted classes.
In order to steer the model prediction toward the correct class within the top predicted classes, we propose the Image-Caption (ICE) method.
Our method can be easily combined with other SOTA methods to enhance Top-1 OOD accuracies by 0.5% on average and up to 3% on challenging datasets.
arXiv Detail & Related papers (2024-02-05T01:14:07Z) - ALIP: Adaptive Language-Image Pre-training with Synthetic Caption [78.93535202851278]
Contrastive Language-Image Pre-training (CLIP) has significantly boosted the performance of various vision-language tasks.
The presence of intrinsic noise and unmatched image-text pairs in web data can potentially affect the performance of representation learning.
We propose an Adaptive Language-Image Pre-training (ALIP), a bi-path model that integrates supervision from both raw text and synthetic caption.
arXiv Detail & Related papers (2023-08-16T15:19:52Z) - Self-Supervised Image Captioning with CLIP [0.0]
We introduce a self-supervised image captioning method.
After learning an initial signal from a small labeled dataset, our method transitions to self-supervised learning on unlabeled data.
Despite utilizing less than 2% of the labeled COCO dataset, our method delivers a performance comparable to state-of-the-art models trained on the complete dataset.
arXiv Detail & Related papers (2023-06-26T23:29:16Z) - Weakly Supervised Vision-and-Language Pre-training with Relative
Representations [76.63610760577214]
Weakly supervised vision-and-language pre-training has been shown to effectively reduce the data cost of pre-training.
Current methods use only local descriptions of images, i.e., object tags, as cross-modal anchors to construct weakly-aligned image-text pairs for pre-training.
arXiv Detail & Related papers (2023-05-24T18:10:24Z) - Exploring CLIP for Assessing the Look and Feel of Images [87.97623543523858]
We introduce Contrastive Language-Image Pre-training (CLIP) models for assessing both the quality perception (look) and abstract perception (feel) of images in a zero-shot manner.
Our results show that CLIP captures meaningful priors that generalize well to different perceptual assessments.
arXiv Detail & Related papers (2022-07-25T17:58:16Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - Improving Text-to-Image Synthesis Using Contrastive Learning [4.850820365312369]
We propose a contrastive learning approach to improve the quality and enhance the semantic consistency of synthetic images.
We evaluate our approach over two popular text-to-image synthesis models, AttnGAN and DM-GAN, on datasets CUB and COCO.
arXiv Detail & Related papers (2021-07-06T06:43:31Z) - Visually Grounded Compound PCFGs [65.04669567781634]
Exploiting visual groundings for language understanding has recently been drawing much attention.
We study visually grounded grammar induction and learn a constituency from both unlabeled text and its visual captions.
arXiv Detail & Related papers (2020-09-25T19:07:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.