Uncertainty-Aware Image Captioning
- URL: http://arxiv.org/abs/2211.16769v1
- Date: Wed, 30 Nov 2022 06:19:47 GMT
- Title: Uncertainty-Aware Image Captioning
- Authors: Zhengcong Fei, Mingyuan Fan, Li Zhu, Junshi Huang, Xiaoming Wei,
Xiaolin Wei
- Abstract summary: We propose an uncertainty-aware image captioning framework.
We use an image-conditioned bag-of-word model to measure the word uncertainty.
Our approach outperforms the strong baseline and related methods on both captioning quality and decoding speed.
- Score: 40.984969950016236
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: It is well believed that the higher uncertainty in a word of the caption, the
more inter-correlated context information is required to determine it. However,
current image captioning methods usually consider the generation of all words
in a sentence sequentially and equally. In this paper, we propose an
uncertainty-aware image captioning framework, which parallelly and iteratively
operates insertion of discontinuous candidate words between existing words from
easy to difficult until converged. We hypothesize that high-uncertainty words
in a sentence need more prior information to make a correct decision and should
be produced at a later stage. The resulting non-autoregressive hierarchy makes
the caption generation explainable and intuitive. Specifically, we utilize an
image-conditioned bag-of-word model to measure the word uncertainty and apply a
dynamic programming algorithm to construct the training pairs. During
inference, we devise an uncertainty-adaptive parallel beam search technique
that yields an empirically logarithmic time complexity. Extensive experiments
on the MS COCO benchmark reveal that our approach outperforms the strong
baseline and related methods on both captioning quality as well as decoding
speed.
Related papers
- Image Generation from Contextually-Contradictory Prompts [50.999420029656214]
We propose a stage-aware prompt decomposition framework that guides the denoising process using a sequence of proxy prompts.<n>Our method enables fine-grained semantic control and accurate image generation in the presence of contextual contradictions.
arXiv Detail & Related papers (2025-06-02T17:48:12Z) - Latent Beam Diffusion Models for Decoding Image Sequences [11.865234147230616]
Existing methods generate each image independently, leading to disjointed narratives.
We introduce a novel beam search strategy for latent space exploration, enabling conditional generation of full image sequences.
By bridging advances in search optimization and latent space refinement, this work sets a new standard for structured image sequence generation.
arXiv Detail & Related papers (2025-03-26T11:01:10Z) - Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis [9.11767497956649]
This paper proposes leveraging the language comprehension capabilities of large vision-language models to guide the optimization of the initial noisy latent.
We introduce the Noise Diffusion process, which updates the noisy latent to generate semantically faithful images while preserving distribution consistency.
Experimental results demonstrate the effectiveness and adaptability of our framework, consistently enhancing semantic alignment across various diffusion models.
arXiv Detail & Related papers (2024-11-25T15:40:47Z) - Introspective Deep Metric Learning [91.47907685364036]
We propose an introspective deep metric learning framework for uncertainty-aware comparisons of images.
The proposed IDML framework improves the performance of deep metric learning through uncertainty modeling.
arXiv Detail & Related papers (2023-09-11T16:21:13Z) - Introspective Deep Metric Learning for Image Retrieval [80.29866561553483]
We argue that a good similarity model should consider the semantic discrepancies with caution to better deal with ambiguous images for more robust training.
We propose to represent an image using not only a semantic embedding but also an accompanying uncertainty embedding, which describes the semantic characteristics and ambiguity of an image, respectively.
The proposed IDML framework improves the performance of deep metric learning through uncertainty modeling and attains state-of-the-art results on the widely used CUB-200-2011, Cars196, and Stanford Online Products datasets.
arXiv Detail & Related papers (2022-05-09T17:51:44Z) - FILIP: Fine-grained Interactive Language-Image Pre-Training [106.19474076935363]
Fine-grained Interactive Language-Image Pre-training achieves finer-level alignment through a cross-modal late interaction mechanism.
We construct a new large-scale image-text pair dataset called FILIP300M for pre-training.
Experiments show that FILIP achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-11-09T17:15:38Z) - Semi-Autoregressive Image Captioning [153.9658053662605]
Current state-of-the-art approaches for image captioning typically adopt an autoregressive manner.
Non-autoregressive image captioning with continuous iterative refinement can achieve comparable performance to the autoregressive counterparts with a considerable acceleration.
We propose a novel two-stage framework, referred to as Semi-Autoregressive Image Captioning (SAIC) to make a better trade-off between performance and speed.
arXiv Detail & Related papers (2021-10-11T15:11:54Z) - Automatic Vocabulary and Graph Verification for Accurate Loop Closure
Detection [21.862978912891677]
Bag-of-Words (BoW) builds a visual vocabulary to associate features and then detect loops.
We propose a natural convergence criterion based on the comparison between the radii of nodes and the drifts of feature descriptors.
We present a novel topological graph verification method for validating candidate loops.
arXiv Detail & Related papers (2021-07-30T13:19:33Z) - Contrastive Semantic Similarity Learning for Image Captioning Evaluation
with Intrinsic Auto-encoder [52.42057181754076]
Motivated by the auto-encoder mechanism and contrastive representation learning advances, we propose a learning-based metric for image captioning.
We develop three progressive model structures to learn the sentence level representations.
Experiment results show that our proposed method can align well with the scores generated from other contemporary metrics.
arXiv Detail & Related papers (2021-06-29T12:27:05Z) - DeepSim: Semantic similarity metrics for learned image registration [6.789370732159177]
We propose a semantic similarity metric for image registration.
Our approach learns dataset-specific features that drive the optimization of a learning-based registration model.
arXiv Detail & Related papers (2020-11-11T12:35:07Z) - Non-Autoregressive Image Captioning with Counterfactuals-Critical
Multi-Agent Learning [46.060954649681385]
We propose a Non-Autoregressive Image Captioning model with a novel training paradigm: Counterfactuals-critical Multi-Agent Learning (CMAL)
Our NAIC model achieves a performance comparable to state-of-the-art autoregressive models, while brings 13.9x decoding speedup.
arXiv Detail & Related papers (2020-05-10T15:09:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.