Multimodal Story Generation on Plural Images
- URL: http://arxiv.org/abs/2001.10980v2
- Date: Sat, 5 Jun 2021 10:44:37 GMT
- Title: Multimodal Story Generation on Plural Images
- Authors: Jing Jiang
- Abstract summary: We propose to use images as input of the text generation model, called StoryGen.
The model demonstrates the ability to generate meaningful paragraphs of text containing the extracted features from the input images.
- Score: 8.293936347234126
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traditionally, text generation models take in a sequence of text as input,
and iteratively generate the next most probable word using pre-trained
parameters. In this work, we propose the architecture to use images instead of
text as the input of the text generation model, called StoryGen. In the
architecture, we design a Relational Text Data Generator algorithm that relates
different features from multiple images. The output samples from the model
demonstrate the ability to generate meaningful paragraphs of text containing
the extracted features from the input images. This is an undergraduate project
report. Completed Dec. 2019 at the Cooper Union.
Related papers
- Style Generation: Image Synthesis based on Coarsely Matched Texts [10.939482612568433]
We introduce a novel task called text-based style generation and propose a two-stage generative adversarial network.
The first stage generates the overall image style with a sentence feature, and the second stage refines the generated style with a synthetic feature.
The practical potential of our work is demonstrated by various applications such as text-image alignment and story visualization.
arXiv Detail & Related papers (2023-09-08T21:51:11Z) - Intelligent Grimm -- Open-ended Visual Storytelling via Latent Diffusion
Models [70.86603627188519]
We focus on a novel, yet challenging task of generating a coherent image sequence based on a given storyline, denoted as open-ended visual storytelling.
We propose a learning-based auto-regressive image generation model, termed as StoryGen, with a novel vision-language context module.
We show StoryGen can generalize to unseen characters without any optimization, and generate image sequences with coherent content and consistent character.
arXiv Detail & Related papers (2023-06-01T17:58:50Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - GlyphDiffusion: Text Generation as Image Generation [100.98428068214736]
We propose GlyphDiffusion, a novel diffusion approach for text generation via text-guided image generation.
Our key idea is to render the target text as a glyph image containing visual language content.
Our model also makes significant improvements compared to the recent diffusion model.
arXiv Detail & Related papers (2023-04-25T02:14:44Z) - GlyphDraw: Seamlessly Rendering Text with Intricate Spatial Structures
in Text-to-Image Generation [18.396131717250793]
We introduce GlyphDraw, a general learning framework aiming to endow image generation models with the capacity to generate images coherently embedded with text for any specific language.
Our method not only produces accurate language characters as in prompts, but also seamlessly blends the generated text into the background.
arXiv Detail & Related papers (2023-03-31T08:06:33Z) - Unified Multi-Modal Latent Diffusion for Joint Subject and Text
Conditional Image Generation [63.061871048769596]
We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences.
To be more specific, both input texts and images are encoded into one unified multi-modal latent space.
Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
arXiv Detail & Related papers (2023-03-16T13:50:20Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - CIGLI: Conditional Image Generation from Language & Image [5.159265382427163]
We propose a new task called CIGLI: Conditional Image Generation from Language and Image.
Instead of generating an image based on text as in text-image generation, this task requires the generation of an image from a textual description and an image prompt.
arXiv Detail & Related papers (2021-08-20T00:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.