Knowledge-enriched Attention Network with Group-wise Semantic for Visual
Storytelling
- URL: http://arxiv.org/abs/2203.05346v1
- Date: Thu, 10 Mar 2022 12:55:47 GMT
- Title: Knowledge-enriched Attention Network with Group-wise Semantic for Visual
Storytelling
- Authors: Tengpeng Li, Hanli Wang, Bin He, Chang Wen Chen
- Abstract summary: Visual storytelling aims at generating an imaginary and coherent story with narrative multi-sentences from a group of relevant images.
Existing methods often generate direct and rigid descriptions of apparent image-based contents, because they are not capable of exploring implicit information beyond images.
To address these problems, a novel knowledge-enriched attention network with group-wise semantic model is proposed.
- Score: 39.59158974352266
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As a technically challenging topic, visual storytelling aims at generating an
imaginary and coherent story with narrative multi-sentences from a group of
relevant images. Existing methods often generate direct and rigid descriptions
of apparent image-based contents, because they are not capable of exploring
implicit information beyond images. Hence, these schemes could not capture
consistent dependencies from holistic representation, impairing the generation
of reasonable and fluent story. To address these problems, a novel
knowledge-enriched attention network with group-wise semantic model is
proposed. Three main novel components are designed and supported by substantial
experiments to reveal practical advantages. First, a knowledge-enriched
attention network is designed to extract implicit concepts from external
knowledge system, and these concepts are followed by a cascade cross-modal
attention mechanism to characterize imaginative and concrete representations.
Second, a group-wise semantic module with second-order pooling is developed to
explore the globally consistent guidance. Third, a unified one-stage story
generation model with encoder-decoder structure is proposed to simultaneously
train and infer the knowledge-enriched attention network, group-wise semantic
module and multi-modal story generation decoder in an end-to-end fashion.
Substantial experiments on the popular Visual Storytelling dataset with both
objective and subjective evaluation metrics demonstrate the superior
performance of the proposed scheme as compared with other state-of-the-art
methods.
Related papers
- Context-aware Visual Storytelling with Visual Prefix Tuning and Contrastive Learning [2.401993998791928]
We propose a framework that trains a lightweight vision-language mapping network to connect modalities.
We introduce a multimodal contrastive objective that also improves visual relevance and story informativeness.
arXiv Detail & Related papers (2024-08-12T16:15:32Z) - Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models [64.24227572048075]
We propose a Knowledge-Aware Prompt Tuning (KAPT) framework for vision-language models.
Our approach takes inspiration from human intelligence in which external knowledge is usually incorporated into recognizing novel categories of objects.
arXiv Detail & Related papers (2023-08-22T04:24:45Z) - Hierarchical Aligned Multimodal Learning for NER on Tweet Posts [12.632808712127291]
multimodal named entity recognition (MNER) has attracted more attention.
We propose a novel approach, which can dynamically align the image and text sequence.
We conduct experiments on two open datasets, and the results and detailed analysis demonstrate the advantage of our model.
arXiv Detail & Related papers (2023-05-15T06:14:36Z) - Robust Saliency-Aware Distillation for Few-shot Fine-grained Visual
Recognition [57.08108545219043]
Recognizing novel sub-categories with scarce samples is an essential and challenging research topic in computer vision.
Existing literature addresses this challenge by employing local-based representation approaches.
This article proposes a novel model, Robust Saliency-aware Distillation (RSaD), for few-shot fine-grained visual recognition.
arXiv Detail & Related papers (2023-05-12T00:13:17Z) - Self-Supervised Visual Representation Learning with Semantic Grouping [50.14703605659837]
We tackle the problem of learning visual representations from unlabeled scene-centric data.
We propose contrastive learning from data-driven semantic slots, namely SlotCon, for joint semantic grouping and representation learning.
arXiv Detail & Related papers (2022-05-30T17:50:59Z) - Constellation: Learning relational abstractions over objects for
compositional imagination [64.99658940906917]
We introduce Constellation, a network that learns relational abstractions of static visual scenes.
This work is a first step in the explicit representation of visual relationships and using them for complex cognitive procedures.
arXiv Detail & Related papers (2021-07-23T11:59:40Z) - Bowtie Networks: Generative Modeling for Joint Few-Shot Recognition and
Novel-View Synthesis [39.53519330457627]
We propose a novel task of joint few-shot recognition and novel-view synthesis.
We aim to simultaneously learn an object classifier and generate images of that type of object from new viewpoints.
We focus on the interaction and cooperation between a generative model and a discriminative model.
arXiv Detail & Related papers (2020-08-16T19:40:56Z) - Disentangled Speech Embeddings using Cross-modal Self-supervision [119.94362407747437]
We develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video.
We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors.
arXiv Detail & Related papers (2020-02-20T14:13:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.