Related papers: Zero-Shot Text-to-Image Generation

Zero-Shot Text-to-Image Generation

URL: http://arxiv.org/abs/2102.12092v1
Date: Wed, 24 Feb 2021 06:42:31 GMT
Title: Zero-Shot Text-to-Image Generation
Authors: Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, Ilya Sutskever
Abstract summary: We describe a transformer that autoregressively models the text and image tokens as a single stream of data. With sufficient data and scale, our approach is competitive with previous domain-specific models when evaluated in a zero-shot fashion.
Score: 15.135825501365007
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-to-image generation has traditionally focused on finding better modeling assumptions for training on a fixed dataset. These assumptions might involve complex architectures, auxiliary losses, or side information such as object part labels or segmentation masks supplied during training. We describe a simple approach for this task based on a transformer that autoregressively models the text and image tokens as a single stream of data. With sufficient data and scale, our approach is competitive with previous domain-specific models when evaluated in a zero-shot fashion.

Related papers

Low-Biased General Annotated Dataset Generation [62.04202037186855]
We present a low-biased general annotated dataset generation framework (lbGen) Instead of expensive manual collection, we aim at directly generating low-biased images with category annotations. Experimental results confirm that, compared with the manually labeled dataset or other synthetic datasets, the utilization of our generated low-biased dataset leads to stable generalization capacity enhancement.
arXiv Detail & Related papers (2024-12-14T13:28:40Z)
Reinforcing Pre-trained Models Using Counterfactual Images [54.26310919385808]
This paper proposes a novel framework to reinforce classification models using language-guided generated counterfactual images. We identify model weaknesses by testing the model using the counterfactual image dataset. We employ the counterfactual images as an augmented dataset to fine-tune and reinforce the classification model.
arXiv Detail & Related papers (2024-06-19T08:07:14Z)
Adapt Anything: Tailor Any Image Classifiers across Domains And Categories Using Text-to-Image Diffusion Models [82.95591765009105]
We aim to study if a modern text-to-image diffusion model can tailor any task-adaptive image classifier across domains and categories. We utilize only one off-the-shelf text-to-image model to synthesize images with category labels derived from the corresponding text prompts.
arXiv Detail & Related papers (2023-10-25T11:58:14Z)
Shatter and Gather: Learning Referring Image Segmentation with Text Supervision [52.46081425504072]
We present a new model that discovers semantic entities in input image and then combines such entities relevant to text query to predict the mask of the referent. Our method was evaluated on four public benchmarks for referring image segmentation, where it clearly outperformed the existing method for the same task and recent open-vocabulary segmentation models on all the benchmarks.
arXiv Detail & Related papers (2023-08-29T15:39:15Z)
Evaluating Data Attribution for Text-to-Image Models [62.844382063780365]
We evaluate attribution through "customization" methods, which tune an existing large-scale model toward a given exemplar object or style. Our key insight is that this allows us to efficiently create synthetic images that are computationally influenced by the exemplar by construction. By taking into account the inherent uncertainty of the problem, we can assign soft attribution scores over a set of training images.
arXiv Detail & Related papers (2023-06-15T17:59:51Z)
ClipCrop: Conditioned Cropping Driven by Vision-Language Model [90.95403416150724]
We take advantage of vision-language models as a foundation for creating robust and user-intentional cropping algorithms. We develop a method to perform cropping with a text or image query that reflects the user's intention as guidance. Our pipeline design allows the model to learn text-conditioned aesthetic cropping with a small dataset.
arXiv Detail & Related papers (2022-11-21T14:27:07Z)
Prefix Conditioning Unifies Language and Label Supervision [84.11127588805138]
We show that dataset biases negatively affect pre-training by reducing the generalizability of learned representations. In experiments, we show that this simple technique improves the performance in zero-shot image recognition accuracy and robustness to the image-level distribution shift.
arXiv Detail & Related papers (2022-06-02T16:12:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.