Image Captioning with Multi-Context Synthetic Data
- URL: http://arxiv.org/abs/2305.18072v2
- Date: Tue, 19 Dec 2023 14:17:57 GMT
- Title: Image Captioning with Multi-Context Synthetic Data
- Authors: Feipeng Ma, Yizhou Zhou, Fengyun Rao, Yueyi Zhang, Xiaoyan Sun
- Abstract summary: Large models have excelled in producing high-quality images and text.
We present an innovative pipeline that introduces multi-context data generation.
Our model is exclusively trained on synthetic image-text pairs crafted through this process.
- Score: 16.961112970612447
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image captioning requires numerous annotated image-text pairs, resulting in
substantial annotation costs. Recently, large models (e.g. diffusion models and
large language models) have excelled in producing high-quality images and text.
This potential can be harnessed to create synthetic image-text pairs for
training captioning models. Synthetic data can improve cost and time efficiency
in data collection, allow for customization to specific domains, bootstrap
generalization capability for zero-shot performance, and circumvent privacy
concerns associated with real-world data. However, existing methods struggle to
attain satisfactory performance solely through synthetic data. We identify the
issue as generated images from simple descriptions mostly capture a solitary
perspective with limited context, failing to align with the intricate scenes
prevalent in real-world imagery. To tackle this, we present an innovative
pipeline that introduces multi-context data generation. Beginning with an
initial text corpus, our approach employs a large language model to extract
multiple sentences portraying the same scene from diverse viewpoints. These
sentences are then condensed into a single sentence with multiple contexts.
Subsequently, we generate intricate images using the condensed captions through
diffusion models. Our model is exclusively trained on synthetic image-text
pairs crafted through this process. The effectiveness of our pipeline is
validated through experimental results in both the in-domain and cross-domain
settings, where it achieves state-of-the-art performance on well-known datasets
such as MSCOCO, Flickr30k, and NoCaps.
Related papers
- Conditional Text-to-Image Generation with Reference Guidance [81.99538302576302]
This paper explores using additional conditions of an image that provides visual guidance of the particular subjects for diffusion models to generate.
We develop several small-scale expert plugins that efficiently endow a Stable Diffusion model with the capability to take different references.
Our expert plugins demonstrate superior results than the existing methods on all tasks, each containing only 28.55M trainable parameters.
arXiv Detail & Related papers (2024-11-22T21:38:51Z) - Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling [81.69474860607542]
We present Openstory++, a large-scale dataset combining additional instance-level annotations with both images and text.
We also present Cohere-Bench, a pioneering benchmark framework for evaluating the image generation tasks when long multimodal context is provided.
arXiv Detail & Related papers (2024-08-07T11:20:37Z) - Leveraging Unpaired Data for Vision-Language Generative Models via Cycle
Consistency [47.3163261953469]
Current vision-language generative models rely on expansive corpora of paired image-text data to attain optimal performance and generalization capabilities.
We introduce ITIT: an innovative training paradigm grounded in the concept of cycle consistency which allows vision-language training on unpaired image and text data.
ITIT is comprised of a joint image-text encoder with disjoint image and text decoders that enable bidirectional image-to-text and text-to-image generation in a single framework.
arXiv Detail & Related papers (2023-10-05T17:55:19Z) - Improving Multimodal Datasets with Image Captioning [65.74736570293622]
We study how generated captions can increase the utility of web-scraped datapoints with nondescript text.
Our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text.
arXiv Detail & Related papers (2023-07-19T17:47:12Z) - Image Captions are Natural Prompts for Text-to-Image Models [70.30915140413383]
We analyze the relationship between the training effect of synthetic data and the synthetic data distribution induced by prompts.
We propose a simple yet effective method that prompts text-to-image generative models to synthesize more informative and diverse training data.
Our method significantly improves the performance of models trained on synthetic training data.
arXiv Detail & Related papers (2023-07-17T14:38:11Z) - COSA: Concatenated Sample Pretrained Vision-Language Foundation Model [78.32081709802873]
Most vision-language foundation models employ image-text datasets for pretraining.
We propose COSA, a COncatenated SAmple pretrained vision-language foundation model.
We achieve this by sequentially concatenating multiple image-text pairs as inputs for pretraining.
This transformation effectively converts existing image-text corpora into a pseudo long-form video-paragraph corpus.
arXiv Detail & Related papers (2023-06-15T12:29:42Z) - Zero-shot Generation of Coherent Storybook from Plain Text Story using
Diffusion Models [43.32978092618245]
We present a novel neural pipeline for generating a coherent storybook from the plain text of a story.
We leverage a combination of a pre-trained Large Language Model and a text-guided Latent Diffusion Model to generate coherent images.
arXiv Detail & Related papers (2023-02-08T06:24:06Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.