Visual Captioning at Will: Describing Images and Videos Guided by a Few
Stylized Sentences
- URL: http://arxiv.org/abs/2307.16399v1
- Date: Mon, 31 Jul 2023 04:26:01 GMT
- Title: Visual Captioning at Will: Describing Images and Videos Guided by a Few
Stylized Sentences
- Authors: Dingyi Yang, Hongyu Chen, Xinglin Hou, Tiezheng Ge, Yuning Jiang, Qin
Jin
- Abstract summary: Few-Shot Stylized Visual Captioning aims to generate captions in any desired style, using only a few examples as guidance during inference.
We propose a framework called FS-StyleCap for this task, which utilizes a conditional encoder-decoder language model and a visual projection module.
- Score: 49.66987347397398
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stylized visual captioning aims to generate image or video descriptions with
specific styles, making them more attractive and emotionally appropriate. One
major challenge with this task is the lack of paired stylized captions for
visual content, so most existing works focus on unsupervised methods that do
not rely on parallel datasets. However, these approaches still require training
with sufficient examples that have style labels, and the generated captions are
limited to predefined styles. To address these limitations, we explore the
problem of Few-Shot Stylized Visual Captioning, which aims to generate captions
in any desired style, using only a few examples as guidance during inference,
without requiring further training. We propose a framework called FS-StyleCap
for this task, which utilizes a conditional encoder-decoder language model and
a visual projection module. Our two-step training scheme proceeds as follows:
first, we train a style extractor to generate style representations on an
unlabeled text-only corpus. Then, we freeze the extractor and enable our
decoder to generate stylized descriptions based on the extracted style vector
and projected visual content vectors. During inference, our model can generate
desired stylized captions by deriving the style representation from
user-supplied examples. Our automatic evaluation results for few-shot
sentimental visual captioning outperform state-of-the-art approaches and are
comparable to models that are fully trained on labeled style corpora. Human
evaluations further confirm our model s ability to handle multiple styles.
Related papers
- StyleBrush: Style Extraction and Transfer from a Single Image [19.652575295703485]
Stylization for visual content aims to add specific style patterns at the pixel level while preserving the original structural features.
We propose StyleBrush, a method that accurately captures styles from a reference image and brushes'' the extracted style onto other input visual content.
arXiv Detail & Related papers (2024-08-18T14:27:20Z) - Say Anything with Any Style [9.50806457742173]
Say Anything withAny Style queries the discrete style representation via a generative model with a learned style codebook.
Our approach surpasses state-of-theart methods in terms of both lip-synchronization and stylized expression.
arXiv Detail & Related papers (2024-03-11T01:20:03Z) - StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter [78.75422651890776]
StyleCrafter is a generic method that enhances pre-trained T2V models with a style control adapter.
To promote content-style disentanglement, we remove style descriptions from the text prompt and extract style information solely from the reference image.
StyleCrafter efficiently generates high-quality stylized videos that align with the content of the texts and resemble the style of the reference images.
arXiv Detail & Related papers (2023-12-01T03:53:21Z) - StyleAdapter: A Unified Stylized Image Generation Model [97.24936247688824]
StyleAdapter is a unified stylized image generation model capable of producing a variety of stylized images.
It can be integrated with existing controllable synthesis methods, such as T2I-adapter and ControlNet.
arXiv Detail & Related papers (2023-09-04T19:16:46Z) - ADS-Cap: A Framework for Accurate and Diverse Stylized Captioning with
Unpaired Stylistic Corpora [37.53634609063878]
We propose a novel framework to generate Accurate and Diverse Stylized Captions (ADS-Cap)
A conditional variational auto-encoder is then used to automatically diverse stylistic patterns in latent space.
Experimental results on two widely used stylized image captioning datasets show that regarding consistency with the image, style accuracy and diversity, ADS-Cap achieves outstanding performances.
arXiv Detail & Related papers (2023-08-02T13:33:20Z) - StylerDALLE: Language-Guided Style Transfer Using a Vector-Quantized
Tokenizer of a Large-Scale Generative Model [64.26721402514957]
We propose StylerDALLE, a style transfer method that uses natural language to describe abstract art styles.
Specifically, we formulate the language-guided style transfer task as a non-autoregressive token sequence translation.
To incorporate style information, we propose a Reinforcement Learning strategy with CLIP-based language supervision.
arXiv Detail & Related papers (2023-03-16T12:44:44Z) - Controllable Image Captioning via Prompting [9.935191668056463]
We show that a unified model is qualified to perform well in diverse domains and freely switch among multiple styles.
To be specific, we design a set of prompts to fine-tune the pre-trained image captioner.
In the inference stage, our model is able to generate desired stylized captions by choosing the corresponding prompts.
arXiv Detail & Related papers (2022-12-04T11:59:31Z) - AI Illustrator: Translating Raw Descriptions into Images by Prompt-based
Cross-Modal Generation [61.77946020543875]
We propose a framework for translating raw descriptions with complex semantics into semantically corresponding images.
Our framework consists of two components: a projection module from Text Embeddings to Image Embeddings based on prompts, and an adapted image generation module built on StyleGAN.
Benefiting from the pre-trained models, our method can handle complex descriptions and does not require external paired data for training.
arXiv Detail & Related papers (2022-09-07T13:53:54Z) - Generating More Pertinent Captions by Leveraging Semantics and Style on
Multi-Source Datasets [56.018551958004814]
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources.
Large-scale datasets with noisy image-text pairs provide a sub-optimal source of supervision.
We propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component.
arXiv Detail & Related papers (2021-11-24T19:00:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.