Topic Adaptation and Prototype Encoding for Few-Shot Visual Storytelling
- URL: http://arxiv.org/abs/2008.04504v1
- Date: Tue, 11 Aug 2020 03:55:11 GMT
- Title: Topic Adaptation and Prototype Encoding for Few-Shot Visual Storytelling
- Authors: Jiacheng Li, Siliang Tang, Juncheng Li, Jun Xiao, Fei Wu, Shiliang Pu,
Yueting Zhuang
- Abstract summary: We propose a topic adaptive storyteller to model the ability of inter-topic generalization.
We also propose a prototype encoding structure to model the ability of intra-topic derivation.
Experimental results show that topic adaptation and prototype encoding structure mutually bring benefit to the few-shot model.
- Score: 81.33107307509718
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Storytelling~(VIST) is a task to tell a narrative story about a
certain topic according to the given photo stream. The existing studies focus
on designing complex models, which rely on a huge amount of human-annotated
data. However, the annotation of VIST is extremely costly and many topics
cannot be covered in the training dataset due to the long-tail topic
distribution. In this paper, we focus on enhancing the generalization ability
of the VIST model by considering the few-shot setting. Inspired by the way
humans tell a story, we propose a topic adaptive storyteller to model the
ability of inter-topic generalization. In practice, we apply the gradient-based
meta-learning algorithm on multi-modal seq2seq models to endow the model the
ability to adapt quickly from topic to topic. Besides, We further propose a
prototype encoding structure to model the ability of intra-topic derivation.
Specifically, we encode and restore the few training story text to serve as a
reference to guide the generation at inference time. Experimental results show
that topic adaptation and prototype encoding structure mutually bring benefit
to the few-shot model on BLEU and METEOR metric. The further case study shows
that the stories generated after few-shot adaptation are more relative and
expressive.
Related papers
- TARN-VIST: Topic Aware Reinforcement Network for Visual Storytelling [14.15543866199545]
As a cross-modal task, visual storytelling aims to generate a story for an ordered image sequence automatically.
We propose a novel method, Topic Aware Reinforcement Network for VIsual StoryTelling (TARN-VIST)
In particular, we pre-extracted the topic information of stories from both visual and linguistic perspectives.
arXiv Detail & Related papers (2024-03-18T08:01:23Z) - Controllable Topic-Focused Abstractive Summarization [57.8015120583044]
Controlled abstractive summarization focuses on producing condensed versions of a source article to cover specific aspects.
This paper presents a new Transformer-based architecture capable of producing topic-focused summaries.
arXiv Detail & Related papers (2023-11-12T03:51:38Z) - Let the Pretrained Language Models "Imagine" for Short Texts Topic
Modeling [29.87929724277381]
In short texts, co-occurrence information is minimal, which results in feature sparsity in document representation.
Existing topic models (probabilistic or neural) mostly fail to mine patterns from them to generate coherent topics.
We extend short text into longer sequences using existing pre-trained language models (PLMs)
arXiv Detail & Related papers (2023-10-24T00:23:30Z) - StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story
Continuation [76.44802273236081]
We develop a model StoryDALL-E for story continuation, where the generated visual story is conditioned on a source image.
We show that our retro-fitting approach outperforms GAN-based models for story continuation and facilitates copying of visual elements from the source image.
Overall, our work demonstrates that pretrained text-to-image synthesis models can be adapted for complex and low-resource tasks like story continuation.
arXiv Detail & Related papers (2022-09-13T17:47:39Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - TopNet: Learning from Neural Topic Model to Generate Long Stories [43.5564336855688]
Long story generation (LSG) is one of the coveted goals in natural language processing.
We propose emphTopNet to obtain high-quality skeleton words to complement the short input.
Our proposed framework is highly effective in skeleton word selection and significantly outperforms state-of-the-art models in both automatic evaluation and human evaluation.
arXiv Detail & Related papers (2021-12-14T09:47:53Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z) - Unsupervised Graph-based Topic Modeling from Video Transcriptions [5.210353244951637]
We develop a topic extractor on video transcriptions using neural word embeddings and a graph-based clustering method.
Experimental results on the real-life multimodal data set MuSe-CaR demonstrate that our approach extracts coherent and meaningful topics.
arXiv Detail & Related papers (2021-05-04T12:48:17Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.