Similar Scenes arouse Similar Emotions: Parallel Data Augmentation for
Stylized Image Captioning
- URL: http://arxiv.org/abs/2108.11912v1
- Date: Thu, 26 Aug 2021 17:08:58 GMT
- Title: Similar Scenes arouse Similar Emotions: Parallel Data Augmentation for
Stylized Image Captioning
- Authors: Guodun Li, Yuchen Zhai, Zehao Lin, Yin Zhang
- Abstract summary: Stylized image captioning systems aim to generate a caption consistent with a given style description.
Many studies focus on unsupervised approaches, without considering from the perspective of data augmentation.
We propose a novel Extract-Retrieve-Generate data augmentation framework to extract style phrases from small-scale stylized sentences.
- Score: 3.0415487485299373
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stylized image captioning systems aim to generate a caption not only
semantically related to a given image but also consistent with a given style
description. One of the biggest challenges with this task is the lack of
sufficient paired stylized data. Many studies focus on unsupervised approaches,
without considering from the perspective of data augmentation. We begin with
the observation that people may recall similar emotions when they are in
similar scenes, and often express similar emotions with similar style phrases,
which underpins our data augmentation idea. In this paper, we propose a novel
Extract-Retrieve-Generate data augmentation framework to extract style phrases
from small-scale stylized sentences and graft them to large-scale factual
captions. First, we design the emotional signal extractor to extract style
phrases from small-scale stylized sentences. Second, we construct the plugable
multi-modal scene retriever to retrieve scenes represented with pairs of an
image and its stylized caption, which are similar to the query image or caption
in the large-scale factual data. In the end, based on the style phrases of
similar scenes and the factual description of the current scene, we build the
emotion-aware caption generator to generate fluent and diversified stylized
captions for the current scene. Extensive experimental results show that our
framework can alleviate the data scarcity problem effectively. It also
significantly boosts the performance of several existing image captioning
models in both supervised and unsupervised settings, which outperforms the
state-of-the-art stylized image captioning methods in terms of both sentence
relevance and stylishness by a substantial margin.
Related papers
- Story Visualization by Online Text Augmentation with Context Memory [64.86944645907771]
We propose a novel memory architecture for the Bi-directional Transformer framework with an online text augmentation.
The proposed method significantly outperforms the state of the arts in various metrics including FID, character F1, frame accuracy, BLEU-2/3, and R-precision.
arXiv Detail & Related papers (2023-08-15T05:08:12Z) - Visual Captioning at Will: Describing Images and Videos Guided by a Few
Stylized Sentences [49.66987347397398]
Few-Shot Stylized Visual Captioning aims to generate captions in any desired style, using only a few examples as guidance during inference.
We propose a framework called FS-StyleCap for this task, which utilizes a conditional encoder-decoder language model and a visual projection module.
arXiv Detail & Related papers (2023-07-31T04:26:01Z) - Image Captioning with Multi-Context Synthetic Data [16.961112970612447]
Large models have excelled in producing high-quality images and text.
We present an innovative pipeline that introduces multi-context data generation.
Our model is exclusively trained on synthetic image-text pairs crafted through this process.
arXiv Detail & Related papers (2023-05-29T13:18:59Z) - FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph
Parsing [66.70054075041487]
Existing scene graphs that convert image captions into scene graphs often suffer from two types of errors.
First, the generated scene graphs fail to capture the true semantics of the captions or the corresponding images, resulting in a lack of faithfulness.
Second, the generated scene graphs have high inconsistency, with the same semantics represented by different annotations.
arXiv Detail & Related papers (2023-05-27T15:38:31Z) - Word-Level Fine-Grained Story Visualization [58.16484259508973]
Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story with a global consistency across dynamic scenes and characters.
Current works still struggle with output images' quality and consistency, and rely on additional semantic information or auxiliary captioning networks.
We first introduce a new sentence representation, which incorporates word information from all story sentences to mitigate the inconsistency problem.
Then, we propose a new discriminator with fusion features to improve image quality and story consistency.
arXiv Detail & Related papers (2022-08-03T21:01:47Z) - CapOnImage: Context-driven Dense-Captioning on Image [13.604173177437536]
We introduce a new task called captioning on image (CapOnImage), which aims to generate dense captions at different locations of the image based on contextual information.
We propose a multi-modal pre-training model with multi-level pre-training tasks that progressively learn the correspondence between texts and image locations.
Compared with other image captioning model variants, our model achieves the best results in both captioning accuracy and diversity aspects.
arXiv Detail & Related papers (2022-04-27T14:40:31Z) - MOC-GAN: Mixing Objects and Captions to Generate Realistic Images [21.240099965546637]
We introduce a more rational setting, generating a realistic image from the objects and captions.
Under this setting, objects explicitly define the critical roles in the targeted images and captions implicitly describe their rich attributes and connections.
A MOC-GAN is proposed to mix the inputs of two modalities to generate realistic images.
arXiv Detail & Related papers (2021-06-06T14:04:07Z) - Matching Visual Features to Hierarchical Semantic Topics for Image
Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework.
To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network.
To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z) - Towards Accurate Text-based Image Captioning with Content Diversity
Exploration [46.061291298616354]
Text-based image captioning (TextCap) which aims to read and reason images with texts is crucial for a machine to understand a detailed and complex scene environment.
Existing methods attempt to extend the traditional image captioning methods to solve this task, which focus on describing the overall scene of images by one global caption.
This is infeasible because the complex text and visual information cannot be described well within one caption.
arXiv Detail & Related papers (2021-04-23T08:57:47Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.