Generating Descriptions for Sequential Images with Local-Object
Attention and Global Semantic Context Modelling
- URL: http://arxiv.org/abs/2012.01295v1
- Date: Wed, 2 Dec 2020 16:07:32 GMT
- Title: Generating Descriptions for Sequential Images with Local-Object
Attention and Global Semantic Context Modelling
- Authors: Jing Su, Chenghua Lin, Mian Zhou, Qingyun Dai, Haoyu Lv
- Abstract summary: We propose an end-to-end CNN-LSTM model for generating descriptions for sequential images with a local-object attention mechanism.
We capture global semantic context using a multi-layer perceptron, which learns the dependencies between sequential images.
A paralleled LSTM network is exploited for decoding the sequence descriptions.
- Score: 5.362051433497476
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose an end-to-end CNN-LSTM model for generating
descriptions for sequential images with a local-object attention mechanism. To
generate coherent descriptions, we capture global semantic context using a
multi-layer perceptron, which learns the dependencies between sequential
images. A paralleled LSTM network is exploited for decoding the sequence
descriptions. Experimental results show that our model outperforms the baseline
across three different evaluation metrics on the datasets published by
Microsoft.
Related papers
- RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z) - Towards Local Visual Modeling for Image Captioning [87.02744388237045]
We propose a Locality-Sensitive Transformer Network (LSTNet) with two novel designs, namely Locality-Sensitive Attention (LSA) and Locality-Sensitive Fusion (LSF)
LSA is deployed for the intra-layer interaction in Transformer via modeling the relationship between each grid and its neighbors.
LSF is used for inter-layer information fusion, which aggregates the information of different encoder layers for cross-layer semantical complementarity.
arXiv Detail & Related papers (2023-02-13T04:42:00Z) - Learning to Model Multimodal Semantic Alignment for Story Visualization [58.16484259508973]
Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story.
Current works face the problem of semantic misalignment because of their fixed architecture and diversity of input modalities.
We explore the semantic alignment between text and image representations by learning to match their semantic levels in the GAN-based generative model.
arXiv Detail & Related papers (2022-11-14T11:41:44Z) - Semantic Image Synthesis via Diffusion Models [159.4285444680301]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks.
Recent work on semantic image synthesis mainly follows the emphde facto Generative Adversarial Nets (GANs)
arXiv Detail & Related papers (2022-06-30T18:31:51Z) - Guiding Attention using Partial-Order Relationships for Image Captioning [2.620091916172863]
A guided attention network mechanism exploits the relationship between the visual scene and text-descriptions.
A pairwise ranking objective is used for training this embedding space which allows similar images, topics and captions in the shared semantic space.
The experimental results based on MSCOCO dataset shows the competitiveness of our approach.
arXiv Detail & Related papers (2022-04-15T14:22:09Z) - Matching Visual Features to Hierarchical Semantic Topics for Image
Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework.
To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network.
To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z) - Encoder Fusion Network with Co-Attention Embedding for Referring Image
Segmentation [87.01669173673288]
We propose an encoder fusion network (EFN), which transforms the visual encoder into a multi-modal feature learning network.
A co-attention mechanism is embedded in the EFN to realize the parallel update of multi-modal features.
The experiment results on four benchmark datasets demonstrate that the proposed approach achieves the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-05-05T02:27:25Z) - BERT-hLSTMs: BERT and Hierarchical LSTMs for Visual Storytelling [6.196023076311228]
We propose a novel hierarchical visual storytelling framework which separately models sentence-level and word-level semantics.
We then employ a hierarchical LSTM network: the bottom LSTM receives as input the sentence vector representation from BERT, to learn the dependencies between the sentences corresponding to images, and the top LSTM is responsible for generating the corresponding word vector representations.
Experimental results demonstrate that our model outperforms most closely related baselines under automatic evaluation metrics BLEU and CIDEr.
arXiv Detail & Related papers (2020-12-03T18:07:28Z) - Image Captioning with Compositional Neural Module Networks [18.27510863075184]
We introduce a hierarchical framework for image captioning that explores both compositionality and sequentiality of natural language.
Our algorithm learns to compose a detail-rich sentence by selectively attending to different modules corresponding to unique aspects of each object detected in an input image.
arXiv Detail & Related papers (2020-07-10T20:58:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.