Let Storytelling Tell Vivid Stories: An Expressive and Fluent Multimodal
Storyteller
- URL: http://arxiv.org/abs/2403.07301v1
- Date: Tue, 12 Mar 2024 04:07:00 GMT
- Title: Let Storytelling Tell Vivid Stories: An Expressive and Fluent Multimodal
Storyteller
- Authors: Chuanqi Zang, Jiji Tang, Rongsheng Zhang, Zeng Zhao, Tangjie Lv,
Mingtao Pei, Wei Liang
- Abstract summary: We propose a new pipeline, termed LLaMS, to generate multimodal human-level stories.
We first employ a sequence data auto-enhancement strategy to enhance factual content expression.
Secondly, we propose SQ-Adatpter module for story illustration generation which can maintain sequence consistency.
- Score: 21.953766228135827
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Storytelling aims to generate reasonable and vivid narratives based on an
ordered image stream. The fidelity to the image story theme and the divergence
of story plots attract readers to keep reading. Previous works iteratively
improved the alignment of multiple modalities but ultimately resulted in the
generation of simplistic storylines for image streams. In this work, we propose
a new pipeline, termed LLaMS, to generate multimodal human-level stories that
are embodied in expressiveness and consistency. Specifically, by fully
exploiting the commonsense knowledge within the LLM, we first employ a sequence
data auto-enhancement strategy to enhance factual content expression and
leverage a textual reasoning architecture for expressive story generation and
prediction. Secondly, we propose SQ-Adatpter module for story illustration
generation which can maintain sequence consistency. Numerical results are
conducted through human evaluation to verify the superiority of proposed LLaMS.
Evaluations show that LLaMS achieves state-of-the-art storytelling performance
and 86% correlation and 100% consistency win rate as compared with previous
SOTA methods. Furthermore, ablation experiments are conducted to verify the
effectiveness of proposed sequence data enhancement and SQ-Adapter.
Related papers
- DataNarrative: Automated Data-Driven Storytelling with Visualizations and Texts [27.218934418961197]
We introduce a novel task for data story generation and a benchmark containing 1,449 stories from diverse sources.
To address the challenges of crafting coherent data stories, we propose a multiagent framework employing two LLM agents.
While our agentic framework generally outperforms non-agentic counterparts in both model-based and human evaluations, the results also reveal unique challenges in data story generation.
arXiv Detail & Related papers (2024-08-09T21:31:33Z) - SEED-Story: Multimodal Long Story Generation with Large Language Model [66.37077224696242]
SEED-Story is a novel method that leverages a Multimodal Large Language Model (MLLM) to generate extended multimodal stories.
We propose multimodal attention sink mechanism to enable the generation of stories with up to 25 sequences (only 10 for training) in a highly efficient autoregressive manner.
We present a large-scale and high-resolution dataset named StoryStream for training our model and quantitatively evaluating the task of multimodal story generation in various aspects.
arXiv Detail & Related papers (2024-07-11T17:21:03Z) - Improving Visual Storytelling with Multimodal Large Language Models [1.325953054381901]
This paper presents a novel approach leveraging large language models (LLMs) and large vision-language models (LVLMs)
We introduce a new dataset comprising diverse visual stories, annotated with detailed captions and multimodal elements.
Our method employs a combination of supervised and reinforcement learning to fine-tune the model, enhancing its narrative generation capabilities.
arXiv Detail & Related papers (2024-07-02T18:13:55Z) - TARN-VIST: Topic Aware Reinforcement Network for Visual Storytelling [14.15543866199545]
As a cross-modal task, visual storytelling aims to generate a story for an ordered image sequence automatically.
We propose a novel method, Topic Aware Reinforcement Network for VIsual StoryTelling (TARN-VIST)
In particular, we pre-extracted the topic information of stories from both visual and linguistic perspectives.
arXiv Detail & Related papers (2024-03-18T08:01:23Z) - Text-Only Training for Visual Storytelling [107.19873669536523]
We formulate visual storytelling as a visual-conditioned story generation problem.
We propose a text-only training method that separates the learning of cross-modality alignment and story generation.
arXiv Detail & Related papers (2023-08-17T09:32:17Z) - Intelligent Grimm -- Open-ended Visual Storytelling via Latent Diffusion
Models [70.86603627188519]
We focus on a novel, yet challenging task of generating a coherent image sequence based on a given storyline, denoted as open-ended visual storytelling.
We propose a learning-based auto-regressive image generation model, termed as StoryGen, with a novel vision-language context module.
We show StoryGen can generalize to unseen characters without any optimization, and generate image sequences with coherent content and consistent character.
arXiv Detail & Related papers (2023-06-01T17:58:50Z) - Album Storytelling with Iterative Story-aware Captioning and Large
Language Models [86.6548090965982]
We study how to transform an album to vivid and coherent stories, a task we refer to as "album storytelling"
With recent advances in Large Language Models (LLMs), it is now possible to generate lengthy, coherent text.
Our method effectively generates more accurate and engaging stories for albums, with enhanced coherence and vividness.
arXiv Detail & Related papers (2023-05-22T11:45:10Z) - StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story
Continuation [76.44802273236081]
We develop a model StoryDALL-E for story continuation, where the generated visual story is conditioned on a source image.
We show that our retro-fitting approach outperforms GAN-based models for story continuation and facilitates copying of visual elements from the source image.
Overall, our work demonstrates that pretrained text-to-image synthesis models can be adapted for complex and low-resource tasks like story continuation.
arXiv Detail & Related papers (2022-09-13T17:47:39Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.