Text-Only Training for Visual Storytelling
- URL: http://arxiv.org/abs/2308.08881v1
- Date: Thu, 17 Aug 2023 09:32:17 GMT
- Title: Text-Only Training for Visual Storytelling
- Authors: Yuechen Wang, Wengang Zhou, Zhenbo Lu, Houqiang Li
- Abstract summary: We formulate visual storytelling as a visual-conditioned story generation problem.
We propose a text-only training method that separates the learning of cross-modality alignment and story generation.
- Score: 107.19873669536523
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual storytelling aims to generate a narrative based on a sequence of
images, necessitating both vision-language alignment and coherent story
generation. Most existing solutions predominantly depend on paired image-text
training data, which can be costly to collect and challenging to scale. To
address this, we formulate visual storytelling as a visual-conditioned story
generation problem and propose a text-only training method that separates the
learning of cross-modality alignment and story generation. Our approach
specifically leverages the cross-modality pre-trained CLIP model to integrate
visual control into a story generator, trained exclusively on text data.
Moreover, we devise a training-free visual condition planner that accounts for
the temporal structure of the input image sequence while balancing global and
local visual content. The distinctive advantage of requiring only text data for
training enables our method to learn from external text story data, enhancing
the generalization capability of visual storytelling. We conduct extensive
experiments on the VIST benchmark, showcasing the effectiveness of our approach
in both in-domain and cross-domain settings. Further evaluations on expression
diversity and human assessment underscore the superiority of our method in
terms of informativeness and robustness.
Related papers
- Context-aware Visual Storytelling with Visual Prefix Tuning and Contrastive Learning [2.401993998791928]
We propose a framework that trains a lightweight vision-language mapping network to connect modalities.
We introduce a multimodal contrastive objective that also improves visual relevance and story informativeness.
arXiv Detail & Related papers (2024-08-12T16:15:32Z) - LEGO: Self-Supervised Representation Learning for Scene Text Images [32.21085469233465]
We propose a Local Explicit and Global Order-aware self-supervised representation learning method for scene text images.
Inspired by the human cognitive process of learning words, we propose three novel pre-text tasks for LEGO to model sequential, semantic, and structural features.
The LEGO recognizer achieves superior or comparable performance compared to state-of-the-art scene text recognition methods on six benchmarks.
arXiv Detail & Related papers (2024-08-04T14:07:14Z) - Analogist: Out-of-the-box Visual In-Context Learning with Image Diffusion Model [25.47573567479831]
We propose a novel inference-based visual ICL approach that exploits both visual and textual prompting techniques.
Our method is out-of-the-box and does not require fine-tuning or optimization.
arXiv Detail & Related papers (2024-05-16T17:59:21Z) - TARN-VIST: Topic Aware Reinforcement Network for Visual Storytelling [14.15543866199545]
As a cross-modal task, visual storytelling aims to generate a story for an ordered image sequence automatically.
We propose a novel method, Topic Aware Reinforcement Network for VIsual StoryTelling (TARN-VIST)
In particular, we pre-extracted the topic information of stories from both visual and linguistic perspectives.
arXiv Detail & Related papers (2024-03-18T08:01:23Z) - COSA: Concatenated Sample Pretrained Vision-Language Foundation Model [78.32081709802873]
Most vision-language foundation models employ image-text datasets for pretraining.
We propose COSA, a COncatenated SAmple pretrained vision-language foundation model.
We achieve this by sequentially concatenating multiple image-text pairs as inputs for pretraining.
This transformation effectively converts existing image-text corpora into a pseudo long-form video-paragraph corpus.
arXiv Detail & Related papers (2023-06-15T12:29:42Z) - Intelligent Grimm -- Open-ended Visual Storytelling via Latent Diffusion
Models [70.86603627188519]
We focus on a novel, yet challenging task of generating a coherent image sequence based on a given storyline, denoted as open-ended visual storytelling.
We propose a learning-based auto-regressive image generation model, termed as StoryGen, with a novel vision-language context module.
We show StoryGen can generalize to unseen characters without any optimization, and generate image sequences with coherent content and consistent character.
arXiv Detail & Related papers (2023-06-01T17:58:50Z) - Grounding Language Models to Images for Multimodal Inputs and Outputs [89.30027812161686]
We propose an efficient method to ground pretrained text-only language models to the visual domain.
We process arbitrarily interleaved image-and-text data, and generate text interleaved with retrieved images.
arXiv Detail & Related papers (2023-01-31T18:33:44Z) - Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks.
Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts.
We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.