A Picture is Worth a Thousand Words: Principled Recaptioning Improves
Image Generation
- URL: http://arxiv.org/abs/2310.16656v1
- Date: Wed, 25 Oct 2023 14:10:08 GMT
- Title: A Picture is Worth a Thousand Words: Principled Recaptioning Improves
Image Generation
- Authors: Eyal Segalis, Dani Valevski, Danny Lumen, Yossi Matias, Yaniv
Leviathan
- Abstract summary: We show that by relabeling the corpus with a specialized automatic captioning model and training a text-to-image model on the recaptioned dataset, the model benefits substantially across the board.
We analyze various ways to relabel the corpus and provide evidence that this technique, which we call RECAP, both reduces the train-inference discrepancy and provides the model with more information per example.
- Score: 9.552642210681489
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-image diffusion models achieved a remarkable leap in capabilities
over the last few years, enabling high-quality and diverse synthesis of images
from a textual prompt. However, even the most advanced models often struggle to
precisely follow all of the directions in their prompts. The vast majority of
these models are trained on datasets consisting of (image, caption) pairs where
the images often come from the web, and the captions are their HTML alternate
text. A notable example is the LAION dataset, used by Stable Diffusion and
other models. In this work we observe that these captions are often of low
quality, and argue that this significantly affects the model's capability to
understand nuanced semantics in the textual prompts. We show that by relabeling
the corpus with a specialized automatic captioning model and training a
text-to-image model on the recaptioned dataset, the model benefits
substantially across the board. First, in overall image quality: e.g. FID 14.84
vs. the baseline of 17.87, and 64.3% improvement in faithful image generation
according to human evaluation. Second, in semantic alignment, e.g. semantic
object accuracy 84.34 vs. 78.90, counting alignment errors 1.32 vs. 1.44 and
positional alignment 62.42 vs. 57.60. We analyze various ways to relabel the
corpus and provide evidence that this technique, which we call RECAP, both
reduces the train-inference discrepancy and provides the model with more
information per example, increasing sample efficiency and allowing the model to
better understand the relations between captions and images.
Related papers
- Removing Distributional Discrepancies in Captions Improves Image-Text Alignment [76.31530836622694]
We introduce a model designed to improve the prediction of image-text alignment.
Our approach focuses on generating high-quality training datasets for the alignment task.
We also demonstrate the applicability of our model by ranking the images generated by text-to-image models based on text alignment.
arXiv Detail & Related papers (2024-10-01T17:50:17Z) - Learning from Models and Data for Visual Grounding [55.21937116752679]
We introduce SynGround, a framework that combines data-driven learning and knowledge transfer from various large-scale pretrained models.
We finetune a pretrained vision-and-language model on this dataset by optimizing a mask-attention objective.
The resulting model improves the grounding capabilities of an off-the-shelf vision-and-language model.
arXiv Detail & Related papers (2024-03-20T17:59:43Z) - Improving Cross-modal Alignment with Synthetic Pairs for Text-only Image
Captioning [13.357749288588039]
Previous works leverage the CLIP's cross-modal association ability for image captioning, relying solely on textual information under unsupervised settings.
This paper proposes a novel method to address these issues by incorporating synthetic image-text pairs.
A pre-trained text-to-image model is deployed to obtain images that correspond to textual data, and the pseudo features of generated images are optimized toward the real ones in the CLIP embedding space.
arXiv Detail & Related papers (2023-12-14T12:39:29Z) - Filter & Align: Leveraging Human Knowledge to Curate Image-Text Data [31.507451966555383]
We present a novel algorithm that incorporates human knowledge on image-text alignment to guide filtering vast corpus of web-crawled image-text datasets.
We collect a diverse image-text dataset where each image is associated with multiple captions from various sources.
We train a reward model on these human-preference annotations to internalize the nuanced human understanding of image-text alignment.
arXiv Detail & Related papers (2023-12-11T05:57:09Z) - Paragraph-to-Image Generation with Information-Enriched Diffusion Model [67.9265336953134]
ParaDiffusion is an information-enriched diffusion model for paragraph-to-image generation task.
It delves into the transference of the extensive semantic comprehension capabilities of large language models to the task of image generation.
The code and dataset will be released to foster community research on long-text alignment.
arXiv Detail & Related papers (2023-11-24T05:17:01Z) - Hypernymy Understanding Evaluation of Text-to-Image Models via WordNet
Hierarchy [12.82992353036576]
We measure the capability of popular text-to-image models to understand $textithypernymy$, or the "is-a" relation between words.
We show how our metrics can provide a better understanding of the individual strengths and weaknesses of popular text-to-image models.
arXiv Detail & Related papers (2023-10-13T16:53:25Z) - Guiding Image Captioning Models Toward More Specific Captions [32.36062034676917]
We show that it is possible to generate more specific captions with minimal changes to the training process.
We implement classifier-free guidance for an autoregressive captioning model by fine-tuning it to estimate both conditional and unconditional distributions over captions.
arXiv Detail & Related papers (2023-07-31T14:00:12Z) - RoCOCO: Robustness Benchmark of MS-COCO to Stress-test Image-Text Matching Models [36.19590638188108]
We create new variants of texts and images in the MS-COCO test set and re-evaluate the state-of-the-art (SOTA) models with the new data.
Specifically, we alter the meaning of text by replacing a word, and generate visually altered images that maintain some visual context.
Our evaluations on the proposed benchmark reveal substantial performance degradation in many SOTA models.
arXiv Detail & Related papers (2023-04-21T03:45:59Z) - Discriminative Class Tokens for Text-to-Image Diffusion Models [107.98436819341592]
We propose a non-invasive fine-tuning technique that capitalizes on the expressive potential of free-form text.
Our method is fast compared to prior fine-tuning methods and does not require a collection of in-class images.
We evaluate our method extensively, showing that the generated images are: (i) more accurate and of higher quality than standard diffusion models, (ii) can be used to augment training data in a low-resource setting, and (iii) reveal information about the data used to train the guiding classifier.
arXiv Detail & Related papers (2023-03-30T05:25:20Z) - Photorealistic Text-to-Image Diffusion Models with Deep Language
Understanding [53.170767750244366]
Imagen is a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding.
To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models.
arXiv Detail & Related papers (2022-05-23T17:42:53Z) - Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [72.60554897161948]
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences.
In this work, we repurpose such models to generate a descriptive text given an image at inference time.
The resulting captions are much less restrictive than those obtained by supervised captioning methods.
arXiv Detail & Related papers (2021-11-29T11:01:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.