MagicScroll: Nontypical Aspect-Ratio Image Generation for Visual
Storytelling via Multi-Layered Semantic-Aware Denoising
- URL: http://arxiv.org/abs/2312.10899v1
- Date: Mon, 18 Dec 2023 03:09:05 GMT
- Title: MagicScroll: Nontypical Aspect-Ratio Image Generation for Visual
Storytelling via Multi-Layered Semantic-Aware Denoising
- Authors: Bingyuan Wang, Hengyu Meng, Zeyu Cai, Lanjiong Li, Yue Ma, Qifeng
Chen, Zeyu Wang
- Abstract summary: MagicScroll is a progressive diffusion-based image generation framework with a novel semantic-aware denoising process.
It enables fine-grained control over the generated image on object, scene, and background levels with text, image, and layout conditions.
It showcases promising results in aligning with the narrative text, improving visual coherence, and engaging the audience.
- Score: 42.20750912837316
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual storytelling often uses nontypical aspect-ratio images like scroll
paintings, comic strips, and panoramas to create an expressive and compelling
narrative. While generative AI has achieved great success and shown the
potential to reshape the creative industry, it remains a challenge to generate
coherent and engaging content with arbitrary size and controllable style,
concept, and layout, all of which are essential for visual storytelling. To
overcome the shortcomings of previous methods including repetitive content,
style inconsistency, and lack of controllability, we propose MagicScroll, a
multi-layered, progressive diffusion-based image generation framework with a
novel semantic-aware denoising process. The model enables fine-grained control
over the generated image on object, scene, and background levels with text,
image, and layout conditions. We also establish the first benchmark for
nontypical aspect-ratio image generation for visual storytelling including
mediums like paintings, comics, and cinematic panoramas, with customized
metrics for systematic evaluation. Through comparative and ablation studies,
MagicScroll showcases promising results in aligning with the narrative text,
improving visual coherence, and engaging the audience. We plan to release the
code and benchmark in the hope of a better collaboration between AI researchers
and creative practitioners involving visual storytelling.
Related papers
- Imagining from Images with an AI Storytelling Tool [0.27309692684728604]
The proposed method explores the multimodal capabilities of GPT-4o to interpret visual content and create engaging stories.
The method is supported by a fully implemented tool, called ImageTeller, which accepts images from diverse sources as input.
arXiv Detail & Related papers (2024-08-21T10:49:15Z) - TARN-VIST: Topic Aware Reinforcement Network for Visual Storytelling [14.15543866199545]
As a cross-modal task, visual storytelling aims to generate a story for an ordered image sequence automatically.
We propose a novel method, Topic Aware Reinforcement Network for VIsual StoryTelling (TARN-VIST)
In particular, we pre-extracted the topic information of stories from both visual and linguistic perspectives.
arXiv Detail & Related papers (2024-03-18T08:01:23Z) - AutoStory: Generating Diverse Storytelling Images with Minimal Human
Effort [55.83007338095763]
We propose an automated story visualization system that can effectively generate diverse, high-quality, and consistent sets of story images.
We utilize the comprehension and planning capabilities of large language models for layout planning, and then leverage large-scale text-to-image models to generate sophisticated story images.
arXiv Detail & Related papers (2023-11-19T06:07:37Z) - Visual Storytelling with Question-Answer Plans [70.89011289754863]
We present a novel framework which integrates visual representations with pretrained language models and planning.
Our model translates the image sequence into a visual prefix, a sequence of continuous embeddings which language models can interpret.
It also leverages a sequence of question-answer pairs as a blueprint plan for selecting salient visual concepts and determining how they should be assembled into a narrative.
arXiv Detail & Related papers (2023-10-08T21:45:34Z) - Intelligent Grimm -- Open-ended Visual Storytelling via Latent Diffusion
Models [70.86603627188519]
We focus on a novel, yet challenging task of generating a coherent image sequence based on a given storyline, denoted as open-ended visual storytelling.
We propose a learning-based auto-regressive image generation model, termed as StoryGen, with a novel vision-language context module.
We show StoryGen can generalize to unseen characters without any optimization, and generate image sequences with coherent content and consistent character.
arXiv Detail & Related papers (2023-06-01T17:58:50Z) - Make-A-Story: Visual Memory Conditioned Consistent Story Generation [57.691064030235985]
We propose a novel autoregressive diffusion-based framework with a visual memory module that implicitly captures the actor and background context.
Our method outperforms prior state-of-the-art in generating frames with high visual quality.
Our experiments for story generation on the MUGEN, the PororoSV and the FlintstonesSV dataset show that our method not only outperforms prior state-of-the-art in generating frames with high visual quality, but also models appropriate correspondences between the characters and the background.
arXiv Detail & Related papers (2022-11-23T21:38:51Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.