Class-Conditional self-reward mechanism for improved Text-to-Image models
- URL: http://arxiv.org/abs/2405.13473v2
- Date: Sat, 25 May 2024 07:05:51 GMT
- Title: Class-Conditional self-reward mechanism for improved Text-to-Image models
- Authors: Safouane El Ghazouali, Arnaud Gucciardi, Umberto Michelucci,
- Abstract summary: We build upon the concept of self-rewarding models and introduce its vision equivalent for Text-to-Image generative AI models.
This approach works by fine-tuning diffusion model on a self-generated self-judged dataset.
It has been evaluated to be at least 60% better than existing commercial and research Text-to-image models.
- Score: 1.8434042562191815
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-rewarding have emerged recently as a powerful tool in the field of Natural Language Processing (NLP), allowing language models to generate high-quality relevant responses by providing their own rewards during training. This innovative technique addresses the limitations of other methods that rely on human preferences. In this paper, we build upon the concept of self-rewarding models and introduce its vision equivalent for Text-to-Image generative AI models. This approach works by fine-tuning diffusion model on a self-generated self-judged dataset, making the fine-tuning more automated and with better data quality. The proposed mechanism makes use of other pre-trained models such as vocabulary based-object detection, image captioning and is conditioned by the a set of object for which the user might need to improve generated data quality. The approach has been implemented, fine-tuned and evaluated on stable diffusion and has led to a performance that has been evaluated to be at least 60\% better than existing commercial and research Text-to-image models. Additionally, the built self-rewarding mechanism allowed a fully automated generation of images, while increasing the visual quality of the generated images and also more efficient following of prompt instructions. The code used in this work is freely available on https://github.com/safouaneelg/SRT2I.
Related papers
- Beyond Human Data: Aligning Multimodal Large Language Models by Iterative Self-Evolution [43.07899102255169]
We propose a novel multimodal self-evolution framework that enables the model to autonomously generate high-quality questions and answers.
First, we implement an image-driven self-questioning mechanism, allowing the model to create and evaluate questions based on image content.
Second, we introduce an answer self-enhancement technique, starting with image captioning to improve answer quality.
arXiv Detail & Related papers (2024-12-20T08:06:00Z) - Self-Improvement in Language Models: The Sharpening Mechanism [70.9248553790022]
We offer a new perspective on the capabilities of self-improvement through a lens we refer to as sharpening.
Motivated by the observation that language models are often better at verifying response quality than they are at generating correct responses, we formalize self-improvement as using the model itself as a verifier during post-training.
We analyze two natural families of self-improvement algorithms based on SFT and RLHF.
arXiv Detail & Related papers (2024-12-02T20:24:17Z) - Diffusion Self-Distillation for Zero-Shot Customized Image Generation [40.11194010431839]
Diffusion Self-Distillation is a method for generating its own dataset for text-conditioned image-to-image tasks.
We first leverage a text-to-image diffusion model's in-context generation ability to create grids of images.
We then fine-tune the text-to-image model into a text+image-to-image model using the curated paired dataset.
arXiv Detail & Related papers (2024-11-27T18:58:52Z) - Enhancing Large Vision Language Models with Self-Training on Image Comprehension [131.14381425260706]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension.
First, the model self-constructs a preference for image descriptions using unlabeled images.
To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z) - Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation [82.5217996570387]
We adapt a pre-trained language model for auto-regressive text-to-image generation.
We find that pre-trained language models offer limited help.
arXiv Detail & Related papers (2023-11-27T07:19:26Z) - SelfEval: Leveraging the discriminative nature of generative models for evaluation [30.239717220862143]
We present an automated way to evaluate the text alignment of text-to-image generative diffusion models.
Our method, called SelfEval, uses the generative model to compute the likelihood of real images given text prompts.
arXiv Detail & Related papers (2023-11-17T18:58:16Z) - Emu: Enhancing Image Generation Models Using Photogenic Needles in a
Haystack [75.00066365801993]
Training text-to-image models with web scale image-text pairs enables the generation of a wide range of visual concepts from text.
These pre-trained models often face challenges when it comes to generating highly aesthetic images.
We propose quality-tuning to guide a pre-trained model to exclusively generate highly visually appealing images.
arXiv Detail & Related papers (2023-09-27T17:30:19Z) - Taming Encoder for Zero Fine-tuning Image Customization with
Text-to-Image Diffusion Models [55.04969603431266]
This paper proposes a method for generating images of customized objects specified by users.
The method is based on a general framework that bypasses the lengthy optimization required by previous approaches.
We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity.
arXiv Detail & Related papers (2023-04-05T17:59:32Z) - LAFITE: Towards Language-Free Training for Text-to-Image Generation [83.2935513540494]
We propose the first work to train text-to-image generation models without any text data.
Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model.
We obtain state-of-the-art results in the standard text-to-image generation tasks.
arXiv Detail & Related papers (2021-11-27T01:54:45Z) - Text-to-Image Generation with Attention Based Recurrent Neural Networks [1.2599533416395765]
We develop a tractable and stable caption-based image generation model.
Experimentations are performed on Microsoft datasets.
Results show that the proposed model performs better than contemporary approaches.
arXiv Detail & Related papers (2020-01-18T12:19:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.