ADS-Cap: A Framework for Accurate and Diverse Stylized Captioning with
Unpaired Stylistic Corpora
- URL: http://arxiv.org/abs/2308.01143v1
- Date: Wed, 2 Aug 2023 13:33:20 GMT
- Title: ADS-Cap: A Framework for Accurate and Diverse Stylized Captioning with
Unpaired Stylistic Corpora
- Authors: Kanzhi Cheng, Zheng Ma, Shi Zong, Jianbing Zhang, Xinyu Dai, Jiajun
Chen
- Abstract summary: We propose a novel framework to generate Accurate and Diverse Stylized Captions (ADS-Cap)
A conditional variational auto-encoder is then used to automatically diverse stylistic patterns in latent space.
Experimental results on two widely used stylized image captioning datasets show that regarding consistency with the image, style accuracy and diversity, ADS-Cap achieves outstanding performances.
- Score: 37.53634609063878
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating visually grounded image captions with specific linguistic styles
using unpaired stylistic corpora is a challenging task, especially since we
expect stylized captions with a wide variety of stylistic patterns. In this
paper, we propose a novel framework to generate Accurate and Diverse Stylized
Captions (ADS-Cap). Our ADS-Cap first uses a contrastive learning module to
align the image and text features, which unifies paired factual and unpaired
stylistic corpora during the training process. A conditional variational
auto-encoder is then used to automatically memorize diverse stylistic patterns
in latent space and enhance diversity through sampling. We also design a simple
but effective recheck module to boost style accuracy by filtering
style-specific captions. Experimental results on two widely used stylized image
captioning datasets show that regarding consistency with the image, style
accuracy and diversity, ADS-Cap achieves outstanding performances compared to
various baselines. We finally conduct extensive analyses to understand the
effectiveness of our method. Our code is available at
https://github.com/njucckevin/ADS-Cap.
Related papers
- Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models [63.01630478059315]
Recent advancements in multimodal models highlight the value of rewritten captions for improving performance.
It is not clear whether synthetic captions and their interaction with original web-crawled AltTexts in pre-training is still not well understood.
We propose a novel, controllable, and scalable captioning pipeline designed to generate diverse caption formats tailored to various multimodal models.
arXiv Detail & Related papers (2024-10-03T17:54:52Z) - StyleCap: Automatic Speaking-Style Captioning from Speech Based on
Speech and Language Self-supervised Learning Models [17.945821635380614]
StyleCap is a method to generate natural language descriptions of speaking styles appearing in speech.
StyleCap is trained with paired data of speech and natural language descriptions.
arXiv Detail & Related papers (2023-11-28T04:49:17Z) - Few-shot Action Recognition with Captioning Foundation Models [61.40271046233581]
CapFSAR is a framework to exploit knowledge of multimodal models without manually annotating text.
Visual-text aggregation module based on Transformer is further designed to incorporate cross-modal-temporal complementary information.
experiments on multiple standard few-shot benchmarks demonstrate that the proposed CapFSAR performs favorably against existing methods.
arXiv Detail & Related papers (2023-10-16T07:08:39Z) - Visual Captioning at Will: Describing Images and Videos Guided by a Few
Stylized Sentences [49.66987347397398]
Few-Shot Stylized Visual Captioning aims to generate captions in any desired style, using only a few examples as guidance during inference.
We propose a framework called FS-StyleCap for this task, which utilizes a conditional encoder-decoder language model and a visual projection module.
arXiv Detail & Related papers (2023-07-31T04:26:01Z) - Controllable Image Captioning via Prompting [9.935191668056463]
We show that a unified model is qualified to perform well in diverse domains and freely switch among multiple styles.
To be specific, we design a set of prompts to fine-tune the pre-trained image captioner.
In the inference stage, our model is able to generate desired stylized captions by choosing the corresponding prompts.
arXiv Detail & Related papers (2022-12-04T11:59:31Z) - Learning Distinct and Representative Styles for Image Captioning [24.13549951795951]
We propose a Discrete Mode Learning (DML) paradigm for image captioning.
Our innovative idea is to explore the rich modes in the training caption corpus to learn a set of "mode embeddings"
In the experiments, we apply the proposed DML to two widely used image captioning models, Transformer and AoANet.
arXiv Detail & Related papers (2022-09-17T03:25:46Z) - Fine-grained Image Captioning with CLIP Reward [104.71533106301598]
We propose using CLIP, a multimodal encoder trained on huge image-text pairs from web, to calculate multimodal similarity and use it as a reward function.
We also propose a simple finetuning strategy of the CLIP text encoder to improve grammar that does not require extra text annotation.
In experiments on text-to-image retrieval and FineCapEval, the proposed CLIP-guided model generates more distinctive captions than the CIDEr-optimized model.
arXiv Detail & Related papers (2022-05-26T02:46:09Z) - Diverse Image Captioning with Grounded Style [19.434931809979282]
We propose COCO-based augmentations to obtain varied stylized captions from COCO annotations.
We encode the stylized information in the latent space of a Variational Autoencoder.
Our experiments on the Senticap and COCO datasets show the ability of our approach to generate accurate captions.
arXiv Detail & Related papers (2022-05-03T22:57:59Z) - CLIP-Adapter: Better Vision-Language Models with Feature Adapters [79.52844563138493]
We show that there is an alternative path to achieve better vision-language models other than prompt tuning.
In this paper, we propose CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch.
Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2021-10-09T11:39:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.