Album Storytelling with Iterative Story-aware Captioning and Large
Language Models
- URL: http://arxiv.org/abs/2305.12943v2
- Date: Wed, 24 May 2023 02:58:03 GMT
- Title: Album Storytelling with Iterative Story-aware Captioning and Large
Language Models
- Authors: Munan Ning, Yujia Xie, Dongdong Chen, Zeyin Song, Lu Yuan, Yonghong
Tian, Qixiang Ye, Li Yuan
- Abstract summary: We study how to transform an album to vivid and coherent stories, a task we refer to as "album storytelling"
With recent advances in Large Language Models (LLMs), it is now possible to generate lengthy, coherent text.
Our method effectively generates more accurate and engaging stories for albums, with enhanced coherence and vividness.
- Score: 86.6548090965982
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work studies how to transform an album to vivid and coherent stories, a
task we refer to as "album storytelling". While this task can help preserve
memories and facilitate experience sharing, it remains an underexplored area in
current literature. With recent advances in Large Language Models (LLMs), it is
now possible to generate lengthy, coherent text, opening up the opportunity to
develop an AI assistant for album storytelling. One natural approach is to use
caption models to describe each photo in the album, and then use LLMs to
summarize and rewrite the generated captions into an engaging story. However,
we find this often results in stories containing hallucinated information that
contradicts the images, as each generated caption ("story-agnostic") is not
always about the description related to the whole story or miss some necessary
information. To address these limitations, we propose a new iterative album
storytelling pipeline. Specifically, we start with an initial story and build a
story-aware caption model to refine the captions using the whole story as
guidance. The polished captions are then fed into the LLMs to generate a new
refined story. This process is repeated iteratively until the story contains
minimal factual errors while maintaining coherence. To evaluate our proposed
pipeline, we introduce a new dataset of image collections from vlogs and a set
of systematic evaluation metrics. Our results demonstrate that our method
effectively generates more accurate and engaging stories for albums, with
enhanced coherence and vividness.
Related papers
- TARN-VIST: Topic Aware Reinforcement Network for Visual Storytelling [14.15543866199545]
As a cross-modal task, visual storytelling aims to generate a story for an ordered image sequence automatically.
We propose a novel method, Topic Aware Reinforcement Network for VIsual StoryTelling (TARN-VIST)
In particular, we pre-extracted the topic information of stories from both visual and linguistic perspectives.
arXiv Detail & Related papers (2024-03-18T08:01:23Z) - Let Storytelling Tell Vivid Stories: An Expressive and Fluent Multimodal
Storyteller [21.953766228135827]
We propose a new pipeline, termed LLaMS, to generate multimodal human-level stories.
We first employ a sequence data auto-enhancement strategy to enhance factual content expression.
Secondly, we propose SQ-Adatpter module for story illustration generation which can maintain sequence consistency.
arXiv Detail & Related papers (2024-03-12T04:07:00Z) - Visual Storytelling with Question-Answer Plans [70.89011289754863]
We present a novel framework which integrates visual representations with pretrained language models and planning.
Our model translates the image sequence into a visual prefix, a sequence of continuous embeddings which language models can interpret.
It also leverages a sequence of question-answer pairs as a blueprint plan for selecting salient visual concepts and determining how they should be assembled into a narrative.
arXiv Detail & Related papers (2023-10-08T21:45:34Z) - Intelligent Grimm -- Open-ended Visual Storytelling via Latent Diffusion
Models [70.86603627188519]
We focus on a novel, yet challenging task of generating a coherent image sequence based on a given storyline, denoted as open-ended visual storytelling.
We propose a learning-based auto-regressive image generation model, termed as StoryGen, with a novel vision-language context module.
We show StoryGen can generalize to unseen characters without any optimization, and generate image sequences with coherent content and consistent character.
arXiv Detail & Related papers (2023-06-01T17:58:50Z) - Visual Writing Prompts: Character-Grounded Story Generation with Curated
Image Sequences [67.61940880927708]
Current work on image-based story generation suffers from the fact that the existing image sequence collections do not have coherent plots behind them.
We improve visual story generation by producing a new image-grounded dataset, Visual Writing Prompts (VWP).
VWP contains almost 2K selected sequences of movie shots, each including 5-10 images.
The image sequences are aligned with a total of 12K stories which were collected via crowdsourcing given the image sequences and a set of grounded characters from the corresponding image sequence.
arXiv Detail & Related papers (2023-01-20T13:38:24Z) - On Narrative Information and the Distillation of Stories [4.224809458327516]
We show how modern artificial neural networks can be leveraged to distill stories.
We then demonstrate how evolutionary algorithms can leverage this to extract a set of narrative templates.
In the process of doing so, we give strong statistical evidence that these narrative information templates are present in existing albums.
arXiv Detail & Related papers (2022-11-22T17:30:36Z) - StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story
Continuation [76.44802273236081]
We develop a model StoryDALL-E for story continuation, where the generated visual story is conditioned on a source image.
We show that our retro-fitting approach outperforms GAN-based models for story continuation and facilitates copying of visual elements from the source image.
Overall, our work demonstrates that pretrained text-to-image synthesis models can be adapted for complex and low-resource tasks like story continuation.
arXiv Detail & Related papers (2022-09-13T17:47:39Z) - Hide-and-Tell: Learning to Bridge Photo Streams for Visual Storytelling [86.42719129731907]
We propose to explicitly learn to imagine a storyline that bridges the visual gap.
We train the network to produce a full plausible story even with missing photo(s)
In experiments, we show that our scheme of hide-and-tell, and the network design are indeed effective at storytelling.
arXiv Detail & Related papers (2020-02-03T14:22:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.