Story Visualization by Online Text Augmentation with Context Memory
- URL: http://arxiv.org/abs/2308.07575v2
- Date: Sat, 19 Aug 2023 07:30:52 GMT
- Title: Story Visualization by Online Text Augmentation with Context Memory
- Authors: Daechul Ahn, Daneul Kim, Gwangmo Song, Seung Hwan Kim, Honglak Lee,
Dongyeop Kang, Jonghyun Choi
- Abstract summary: We propose a novel memory architecture for the Bi-directional Transformer framework with an online text augmentation.
The proposed method significantly outperforms the state of the arts in various metrics including FID, character F1, frame accuracy, BLEU-2/3, and R-precision.
- Score: 64.86944645907771
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Story visualization (SV) is a challenging text-to-image generation task for
the difficulty of not only rendering visual details from the text descriptions
but also encoding a long-term context across multiple sentences. While prior
efforts mostly focus on generating a semantically relevant image for each
sentence, encoding a context spread across the given paragraph to generate
contextually convincing images (e.g., with a correct character or with a proper
background of the scene) remains a challenge. To this end, we propose a novel
memory architecture for the Bi-directional Transformer framework with an online
text augmentation that generates multiple pseudo-descriptions as supplementary
supervision during training for better generalization to the language variation
at inference. In extensive experiments on the two popular SV benchmarks, i.e.,
the Pororo-SV and Flintstones-SV, the proposed method significantly outperforms
the state of the arts in various metrics including FID, character F1, frame
accuracy, BLEU-2/3, and R-precision with similar or less computational
complexity.
Related papers
- Visual Text Generation in the Wild [67.37458807253064]
We propose a visual text generator (termed SceneVTG) which can produce high-quality text images in the wild.
The proposed SceneVTG significantly outperforms traditional rendering-based methods and recent diffusion-based methods in terms of fidelity and reasonability.
The generated images provide superior utility for tasks involving text detection and text recognition.
arXiv Detail & Related papers (2024-07-19T09:08:20Z) - VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning [66.23296689828152]
We leverage the capabilities of Vision-and-Large-Language Models to enhance in-context emotion classification.
In the first stage, we propose prompting VLLMs to generate descriptions in natural language of the subject's apparent emotion.
In the second stage, the descriptions are used as contextual information and, along with the image input, are used to train a transformer-based architecture.
arXiv Detail & Related papers (2024-04-10T15:09:15Z) - Locate, Assign, Refine: Taming Customized Image Inpainting with Text-Subject Guidance [17.251982243534144]
LAR-Gen is a novel approach for image inpainting that enables seamless inpainting of masked scene images.
Our approach adopts a coarse-to-fine manner to ensure subject identity preservation and local semantic coherence.
Experiments and varied application scenarios demonstrate the superiority of LAR-Gen in terms of both identity preservation and text semantic consistency.
arXiv Detail & Related papers (2024-03-28T16:07:55Z) - Masked Generative Story Transformer with Character Guidance and Caption
Augmentation [2.1392064955842023]
Story visualization is a challenging generative vision task, that requires both visual quality and consistency between different frames in generated image sequences.
Previous approaches either employ some kind of memory mechanism to maintain context throughout an auto-regressive generation of the image sequence, or model the generation of the characters and their background separately.
We propose a completely parallel transformer-based approach, relying on Cross-Attention with past and future captions to achieve consistency.
arXiv Detail & Related papers (2024-03-13T13:10:20Z) - COSA: Concatenated Sample Pretrained Vision-Language Foundation Model [78.32081709802873]
Most vision-language foundation models employ image-text datasets for pretraining.
We propose COSA, a COncatenated SAmple pretrained vision-language foundation model.
We achieve this by sequentially concatenating multiple image-text pairs as inputs for pretraining.
This transformation effectively converts existing image-text corpora into a pseudo long-form video-paragraph corpus.
arXiv Detail & Related papers (2023-06-15T12:29:42Z) - Augmented Transformers with Adaptive n-grams Embedding for Multilingual
Scene Text Recognition [10.130342722193204]
This paper proposes an augmented transformer architecture with n-grams embedding and cross-language rectification (TANGER)
TANGER consists of a primary transformer with single patch embeddings of visual images, and a supplementary transformer with adaptive n-grams embeddings.
Cross-language rectification is achieved with a loss function that takes into account both language identification and contextual coherence scoring.
arXiv Detail & Related papers (2023-02-28T02:37:30Z) - Character-Centric Story Visualization via Visual Planning and Token
Alignment [53.44760407148918]
Story visualization advances the traditional text-to-image generation by enabling multiple image generation based on a complete story.
Key challenge of consistent story visualization is to preserve characters that are essential in stories.
We propose to adapt a recent work that augments Vector-Quantized Variational Autoencoders with a text-tovisual-token architecture.
arXiv Detail & Related papers (2022-10-16T06:50:39Z) - Text to Image Generation with Semantic-Spatial Aware GAN [41.73685713621705]
A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions.
We propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information.
arXiv Detail & Related papers (2021-04-01T15:48:01Z) - Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model.
During the training phase, the modality transition network is optimised by the proposed modality loss.
Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.