Word to Sentence Visual Semantic Similarity for Caption Generation:
Lessons Learned
- URL: http://arxiv.org/abs/2209.12817v2
- Date: Thu, 6 Jul 2023 22:58:11 GMT
- Title: Word to Sentence Visual Semantic Similarity for Caption Generation:
Lessons Learned
- Authors: Ahmed Sabir
- Abstract summary: We propose an approach for improving caption generation systems by choosing the most closely related output to the image.
We employ a visual semantic measure in a word and sentence level manner to match the proper caption to the related information in the image.
- Score: 2.1828601975620257
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper focuses on enhancing the captions generated by image-caption
generation systems. We propose an approach for improving caption generation
systems by choosing the most closely related output to the image rather than
the most likely output produced by the model. Our model revises the language
generation output beam search from a visual context perspective. We employ a
visual semantic measure in a word and sentence level manner to match the proper
caption to the related information in the image. The proposed approach can be
applied to any caption system as a post-processing based method.
Related papers
- Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - What Makes for Good Image Captions? [50.48589893443939]
Our framework posits that good image captions should balance three key aspects: informationally sufficient, minimally redundant, and readily comprehensible by humans.
We introduce the Pyramid of Captions (PoCa) method, which generates enriched captions by integrating local and global visual information.
arXiv Detail & Related papers (2024-05-01T12:49:57Z) - Dense Text-to-Image Generation with Attention Modulation [49.287458275920514]
Existing text-to-image diffusion models struggle to synthesize realistic images given dense captions.
We propose DenseDiffusion, a training-free method that adapts a pre-trained text-to-image model to handle such dense captions.
We achieve similar-quality visual results with models specifically trained with layout conditions.
arXiv Detail & Related papers (2023-08-24T17:59:01Z) - CapText: Large Language Model-based Caption Generation From Image
Context and Description [0.0]
We propose and evaluate a new approach to generate captions from textual descriptions and context alone.
Our approach outperforms current state-of-the-art image-text alignment models like OSCAR-VinVL on this task on the CIDEr metric.
arXiv Detail & Related papers (2023-06-01T02:40:44Z) - Belief Revision based Caption Re-ranker with Visual Semantic Information [31.20692237930281]
We propose a novel re-ranking approach that leverages visual-semantic measures to identify the ideal caption.
Our experiments demonstrate the utility of our approach, where we observe that our re-ranker can enhance the performance of a typical image-captioning system.
arXiv Detail & Related papers (2022-09-16T20:36:41Z) - Retrieval-Augmented Transformer for Image Captioning [51.79146669195357]
We develop an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process.
Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens.
Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality.
arXiv Detail & Related papers (2022-07-26T19:35:49Z) - RefineCap: Concept-Aware Refinement for Image Captioning [34.35093893441625]
We propose a novel model, termed RefineCap, that refines the output vocabulary of the language decoder using decoder-guided visual semantics.
Our model achieves superior performance on the MS-COCO dataset in comparison with previous visual-concept based models.
arXiv Detail & Related papers (2021-09-08T10:12:14Z) - Matching Visual Features to Hierarchical Semantic Topics for Image
Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework.
To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network.
To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.