SCO-VIST: Social Interaction Commonsense Knowledge-based Visual
Storytelling
- URL: http://arxiv.org/abs/2402.00319v1
- Date: Thu, 1 Feb 2024 04:09:17 GMT
- Title: SCO-VIST: Social Interaction Commonsense Knowledge-based Visual
Storytelling
- Authors: Eileen Wang, Soyeon Caren Han, Josiah Poon
- Abstract summary: This paper introduces SCO-VIST, a framework representing the image sequence as a graph with objects and relations.
SCO-VIST then takes this graph representing plot points and creates bridges between plot points with semantic and occurrence-based edge weights.
This weighted story graph produces the storyline in a sequence of events using Floyd-Warshall's algorithm.
- Score: 12.560014305032437
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual storytelling aims to automatically generate a coherent story based on
a given image sequence. Unlike tasks like image captioning, visual stories
should contain factual descriptions, worldviews, and human social commonsense
to put disjointed elements together to form a coherent and engaging
human-writeable story. However, most models mainly focus on applying factual
information and using taxonomic/lexical external knowledge when attempting to
create stories. This paper introduces SCO-VIST, a framework representing the
image sequence as a graph with objects and relations that includes human action
motivation and its social interaction commonsense knowledge. SCO-VIST then
takes this graph representing plot points and creates bridges between plot
points with semantic and occurrence-based edge weights. This weighted story
graph produces the storyline in a sequence of events using Floyd-Warshall's
algorithm. Our proposed framework produces stories superior across multiple
metrics in terms of visual grounding, coherence, diversity, and humanness, per
both automatic and human evaluations.
Related papers
- Generating Visual Stories with Grounded and Coreferent Characters [63.07511918366848]
We present the first model capable of predicting visual stories with consistently grounded and coreferent character mentions.
Our model is finetuned on a new dataset which we build on top of the widely used VIST benchmark.
We also propose new evaluation metrics to measure the richness of characters and coreference in stories.
arXiv Detail & Related papers (2024-09-20T14:56:33Z) - TARN-VIST: Topic Aware Reinforcement Network for Visual Storytelling [14.15543866199545]
As a cross-modal task, visual storytelling aims to generate a story for an ordered image sequence automatically.
We propose a novel method, Topic Aware Reinforcement Network for VIsual StoryTelling (TARN-VIST)
In particular, we pre-extracted the topic information of stories from both visual and linguistic perspectives.
arXiv Detail & Related papers (2024-03-18T08:01:23Z) - Visual Storytelling with Question-Answer Plans [70.89011289754863]
We present a novel framework which integrates visual representations with pretrained language models and planning.
Our model translates the image sequence into a visual prefix, a sequence of continuous embeddings which language models can interpret.
It also leverages a sequence of question-answer pairs as a blueprint plan for selecting salient visual concepts and determining how they should be assembled into a narrative.
arXiv Detail & Related papers (2023-10-08T21:45:34Z) - Intelligent Grimm -- Open-ended Visual Storytelling via Latent Diffusion
Models [70.86603627188519]
We focus on a novel, yet challenging task of generating a coherent image sequence based on a given storyline, denoted as open-ended visual storytelling.
We propose a learning-based auto-regressive image generation model, termed as StoryGen, with a novel vision-language context module.
We show StoryGen can generalize to unseen characters without any optimization, and generate image sequences with coherent content and consistent character.
arXiv Detail & Related papers (2023-06-01T17:58:50Z) - Incorporating Commonsense Knowledge into Story Ending Generation via
Heterogeneous Graph Networks [16.360265861788253]
We propose a Story Heterogeneous Graph Network (SHGN) to explicitly model both the information of story context at different levels and the multi-grained interactive relations among them.
In detail, we consider commonsense knowledge, words and sentences as three types of nodes.
We design two auxiliary tasks to implicitly capture the sentiment trend and key events lie in the context.
arXiv Detail & Related papers (2022-01-29T09:33:11Z) - Towards Coherent Visual Storytelling with Ordered Image Attention [73.422281039592]
We develop ordered image attention (OIA) and Image-Sentence Attention (ISA)
OIA models interactions between the sentence-corresponding image and important regions in other images of the sequence.
To generate the story's sentences, we then highlight important image attention vectors with an Image-Sentence Attention (ISA)
arXiv Detail & Related papers (2021-08-04T17:12:39Z) - Enhancing Social Relation Inference with Concise Interaction Graph and
Discriminative Scene Representation [56.25878966006678]
We propose an approach of textbfPRactical textbfInference in textbfSocial rtextbfElation (PRISE)
It concisely learns interactive features of persons and discriminative features of holistic scenes.
PRISE achieves 6.8$%$ improvement for domain classification in PIPA dataset.
arXiv Detail & Related papers (2021-07-30T04:20:13Z) - Plot and Rework: Modeling Storylines for Visual Storytelling [12.353812582863837]
This paper introduces PR-VIST, a framework that represents the input image sequence as a story graph in which it finds the best path to form a storyline.
PR-VIST learns to generate the final story via an iterative training process.
An ablation study shows that both plotting and reworking contribute to the model's superiority.
arXiv Detail & Related papers (2021-05-14T16:41:29Z) - Hide-and-Tell: Learning to Bridge Photo Streams for Visual Storytelling [86.42719129731907]
We propose to explicitly learn to imagine a storyline that bridges the visual gap.
We train the network to produce a full plausible story even with missing photo(s)
In experiments, we show that our scheme of hide-and-tell, and the network design are indeed effective at storytelling.
arXiv Detail & Related papers (2020-02-03T14:22:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.