3M: Multi-style image caption generation using Multi-modality features
under Multi-UPDOWN model
- URL: http://arxiv.org/abs/2103.11186v1
- Date: Sat, 20 Mar 2021 14:12:13 GMT
- Title: 3M: Multi-style image caption generation using Multi-modality features
under Multi-UPDOWN model
- Authors: Chengxi Li and Brent Harrison
- Abstract summary: We propose the 3M model, a Multi-UPDOWN caption model that encodes multi-modality features and decodes them to captions.
We demonstrate the effectiveness of our model on generating human-like captions by examining its performance on two datasets.
- Score: 8.069209836624495
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we build a multi-style generative model for stylish image
captioning which uses multi-modality image features, ResNeXt features and text
features generated by DenseCap. We propose the 3M model, a Multi-UPDOWN caption
model that encodes multi-modality features and decode them to captions. We
demonstrate the effectiveness of our model on generating human-like captions by
examining its performance on two datasets, the PERSONALITY-CAPTIONS dataset and
the FlickrStyle10K dataset. We compare against a variety of state-of-the-art
baselines on various automatic NLP metrics such as BLEU, ROUGE-L, CIDEr, SPICE,
etc. A qualitative study has also been done to verify our 3M model can be used
for generating different stylized captions.
Related papers
- Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models [63.01630478059315]
Recent advancements in multimodal models highlight the value of rewritten captions for improving performance.
It is not clear whether synthetic captions and their interaction with original web-crawled AltTexts in pre-training is still not well understood.
We propose a novel, controllable, and scalable captioning pipeline designed to generate diverse caption formats tailored to various multimodal models.
arXiv Detail & Related papers (2024-10-03T17:54:52Z) - PixelBytes: Catching Unified Embedding for Multimodal Generation [0.0]
PixelBytes Embedding is a novel approach for unified multimodal representation learning.
Inspired by state-of-the-art sequence models such as Image Transformers, PixelCNN, and Mamba-Bytes, PixelBytes aims to address the challenges of integrating different data types.
arXiv Detail & Related papers (2024-09-03T06:02:02Z) - mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models [71.40705814904898]
We introduce the versatile multi-modal large language model, mPLUG-Owl3, which enhances the capability for long image-sequence understanding.
Specifically, we propose novel hyper attention blocks to efficiently integrate vision and language into a common language-guided semantic space.
arXiv Detail & Related papers (2024-08-09T03:25:42Z) - Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data [80.92268916571712]
A critical bottleneck is the scarcity of high-quality 3D objects with detailed captions.
We propose Bootstrap3D, a novel framework that automatically generates an arbitrary quantity of multi-view images.
We have generated 1 million high-quality synthetic multi-view images with dense descriptive captions.
arXiv Detail & Related papers (2024-05-31T17:59:56Z) - Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction
Tuning [115.50132185963139]
CM3Leon is a decoder-only multi-modal language model capable of generating and infilling both text and images.
It is the first multi-modal model trained with a recipe adapted from text-only language models.
CM3Leon achieves state-of-the-art performance in text-to-image generation with 5x less training compute than comparable methods.
arXiv Detail & Related papers (2023-09-05T21:27:27Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - M-VADER: A Model for Diffusion with Multimodal Context [0.786460153386845]
We show how M-VADER enables the generation of images specified using combinations of image and text.
We introduce an embedding model closely related to a vision-language model.
arXiv Detail & Related papers (2022-12-06T12:45:21Z) - TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions.
StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN.
visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space.
instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z) - Fusion Models for Improved Visual Captioning [18.016295296424413]
This paper proposes a generic multimodal model fusion framework for caption generation and emendation.
We employ the same fusion strategies to integrate a pretrained Masked Language Model (MLM) with a visual captioning model, viz. Show, Attend, and Tell.
Our caption emendation experiments on three benchmark image captioning datasets, viz. Flickr8k, Flickr30k, and MSCOCO, show improvements over the baseline.
arXiv Detail & Related papers (2020-10-28T21:55:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.