Guiding Attention using Partial-Order Relationships for Image Captioning
- URL: http://arxiv.org/abs/2204.07476v1
- Date: Fri, 15 Apr 2022 14:22:09 GMT
- Title: Guiding Attention using Partial-Order Relationships for Image Captioning
- Authors: Murad Popattia, Muhammad Rafi, Rizwan Qureshi, Shah Nawaz
- Abstract summary: A guided attention network mechanism exploits the relationship between the visual scene and text-descriptions.
A pairwise ranking objective is used for training this embedding space which allows similar images, topics and captions in the shared semantic space.
The experimental results based on MSCOCO dataset shows the competitiveness of our approach.
- Score: 2.620091916172863
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The use of attention models for automated image captioning has enabled many
systems to produce accurate and meaningful descriptions for images. Over the
years, many novel approaches have been proposed to enhance the attention
process using different feature representations. In this paper, we extend this
approach by creating a guided attention network mechanism, that exploits the
relationship between the visual scene and text-descriptions using spatial
features from the image, high-level information from the topics, and temporal
context from caption generation, which are embedded together in an ordered
embedding space. A pairwise ranking objective is used for training this
embedding space which allows similar images, topics and captions in the shared
semantic space to maintain a partial order in the visual-semantic hierarchy and
hence, helps the model to produce more visually accurate captions. The
experimental results based on MSCOCO dataset shows the competitiveness of our
approach, with many state-of-the-art models on various evaluation metrics.
Related papers
- Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - Multi-modal reward for visual relationships-based image captioning [4.354364351426983]
This paper proposes a deep neural network architecture for image captioning based on fusing the visual relationships information extracted from an image's scene graph with the spatial feature maps of the image.
A multi-modal reward function is then introduced for deep reinforcement learning of the proposed network using a combination of language and vision similarities in a common embedding space.
arXiv Detail & Related papers (2023-03-19T20:52:44Z) - Stacked Cross-modal Feature Consolidation Attention Networks for Image
Captioning [1.4337588659482516]
This paper exploits a feature-compounding approach to bring together high-level semantic concepts and visual information.
We propose a stacked cross-modal feature consolidation (SCFC) attention network for image captioning in which we simultaneously consolidate cross-modal features.
Our proposed SCFC can outperform various state-of-the-art image captioning benchmarks in terms of popular metrics on the MSCOCO and Flickr30K datasets.
arXiv Detail & Related papers (2023-02-08T09:15:09Z) - Learning to Model Multimodal Semantic Alignment for Story Visualization [58.16484259508973]
Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story.
Current works face the problem of semantic misalignment because of their fixed architecture and diversity of input modalities.
We explore the semantic alignment between text and image representations by learning to match their semantic levels in the GAN-based generative model.
arXiv Detail & Related papers (2022-11-14T11:41:44Z) - Matching Visual Features to Hierarchical Semantic Topics for Image
Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework.
To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network.
To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z) - Boost Image Captioning with Knowledge Reasoning [10.733743535624509]
We propose word attention to improve the correctness of visual attention when generating sequential descriptions word-by-word.
We introduce a new strategy to inject external knowledge extracted from knowledge graph into the encoder-decoder framework to facilitate meaningful captioning.
arXiv Detail & Related papers (2020-11-02T12:19:46Z) - Dense Relational Image Captioning via Multi-task Triple-Stream Networks [95.0476489266988]
We introduce dense captioning, a novel task which aims to generate captions with respect to information between objects in a visual scene.
This framework is advantageous in both diversity and amount of information, leading to a comprehensive image understanding.
arXiv Detail & Related papers (2020-10-08T09:17:55Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.