B-SCST: Bayesian Self-Critical Sequence Training for Image Captioning
- URL: http://arxiv.org/abs/2004.02435v2
- Date: Sun, 28 Jun 2020 22:37:13 GMT
- Title: B-SCST: Bayesian Self-Critical Sequence Training for Image Captioning
- Authors: Shashank Bujimalla, Mahesh Subedar, Omesh Tickoo
- Abstract summary: We propose a Bayesian variant of policy-gradient based reinforcement learning technique for image captioning models.
We extend the well-known Self-Critical Sequence Training (SCST) approach for image captioning models by incorporating Bayesian inference.
We show that B-SCST improves CIDEr-D scores on Flickr30k, MS COCO and VizWiz image captioning datasets, compared to the SCST approach.
- Score: 8.7660229706359
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Bayesian deep neural networks (DNNs) can provide a mathematically grounded
framework to quantify uncertainty in predictions from image captioning models.
We propose a Bayesian variant of policy-gradient based reinforcement learning
training technique for image captioning models to directly optimize
non-differentiable image captioning quality metrics such as CIDEr-D. We extend
the well-known Self-Critical Sequence Training (SCST) approach for image
captioning models by incorporating Bayesian inference, and refer to it as
B-SCST. The "baseline" for the policy-gradients in B-SCST is generated by
averaging predictive quality metrics (CIDEr-D) of the captions drawn from the
distribution obtained using a Bayesian DNN model. We infer this predictive
distribution using Monte Carlo (MC) dropout approximate variational inference.
We show that B-SCST improves CIDEr-D scores on Flickr30k, MS COCO and VizWiz
image captioning datasets, compared to the SCST approach. We also provide a
study of uncertainty quantification for the predicted captions, and demonstrate
that it correlates well with the CIDEr-D scores. To our knowledge, this is the
first such analysis, and it can improve the interpretability of image
captioning model outputs, which is critical for practical applications.
Related papers
- Zero-Shot Video Semantic Segmentation based on Pre-Trained Diffusion Models [96.97910688908956]
We introduce the first zero-shot approach for Video Semantic (VSS) based on pre-trained diffusion models.
We propose a framework tailored for VSS based on pre-trained image and video diffusion models.
Experiments show that our proposed approach outperforms existing zero-shot image semantic segmentation approaches.
arXiv Detail & Related papers (2024-05-27T08:39:38Z) - Diffusion-RSCC: Diffusion Probabilistic Model for Change Captioning in Remote Sensing Images [14.236580915897585]
RSICC aims at generating human-like language to describe semantic changes between bi-temporal remote sensing image pairs.
Inspired by the remarkable generative power of diffusion model, we propose a probabilistic diffusion model for RSICC.
In training process, we construct a noise predictor conditioned on cross modal features to learn the distribution from the real caption distribution to the standard Gaussian distribution under the Markov chain.
In testing phase, the well-trained noise predictor helps to estimate the mean value of the distribution and generate change captions step by step.
arXiv Detail & Related papers (2024-05-21T15:44:31Z) - Semantic Approach to Quantifying the Consistency of Diffusion Model Image Generation [0.40792653193642503]
We identify the need for an interpretable, quantitative score of the repeatability, or consistency, of image generation in diffusion models.
We propose a semantic approach, using a pairwise mean CLIP score as our semantic consistency score.
arXiv Detail & Related papers (2024-04-12T20:16:03Z) - Stochastic Segmentation with Conditional Categorical Diffusion Models [3.8168879948759953]
We propose a conditional categorical diffusion model (CCDM) for semantic segmentation based on Denoising Diffusion Probabilistic Models.
Our results show that CCDM achieves state-of-the-art performance on LIDC, and outperforms established baselines on the classical segmentation dataset Cityscapes.
arXiv Detail & Related papers (2023-03-15T19:16:47Z) - Semantic Image Synthesis via Diffusion Models [159.4285444680301]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks.
Recent work on semantic image synthesis mainly follows the emphde facto Generative Adversarial Nets (GANs)
arXiv Detail & Related papers (2022-06-30T18:31:51Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - Injecting Semantic Concepts into End-to-End Image Captioning [61.41154537334627]
We propose a pure vision transformer-based image captioning model, dubbed as ViTCAP, in which grid representations are used without extracting the regional features.
For improved performance, we introduce a novel Concept Token Network (CTN) to predict the semantic concepts and then incorporate them into the end-to-end captioning.
In particular, the CTN is built on the basis of a vision transformer and is designed to predict the concept tokens through a classification task.
arXiv Detail & Related papers (2021-12-09T22:05:05Z) - Explaining and Improving Model Behavior with k Nearest Neighbor
Representations [107.24850861390196]
We propose using k nearest neighbor representations to identify training examples responsible for a model's predictions.
We show that kNN representations are effective at uncovering learned spurious associations.
Our results indicate that the kNN approach makes the finetuned model more robust to adversarial inputs.
arXiv Detail & Related papers (2020-10-18T16:55:25Z) - Explanation-Guided Training for Cross-Domain Few-Shot Classification [96.12873073444091]
Cross-domain few-shot classification task (CD-FSC) combines few-shot classification with the requirement to generalize across domains represented by datasets.
We introduce a novel training approach for existing FSC models.
We show that explanation-guided training effectively improves the model generalization.
arXiv Detail & Related papers (2020-07-17T07:28:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.