Aligning Text-to-Image Models using Human Feedback
- URL: http://arxiv.org/abs/2302.12192v1
- Date: Thu, 23 Feb 2023 17:34:53 GMT
- Title: Aligning Text-to-Image Models using Human Feedback
- Authors: Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig
Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Shixiang Shane Gu
- Abstract summary: Current text-to-image models often generate images that are inadequately aligned with text prompts.
We propose a fine-tuning method for aligning such models using human feedback.
Our results demonstrate the potential for learning from human feedback to significantly improve text-to-image models.
- Score: 104.76638092169604
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep generative models have shown impressive results in text-to-image
synthesis. However, current text-to-image models often generate images that are
inadequately aligned with text prompts. We propose a fine-tuning method for
aligning such models using human feedback, comprising three stages. First, we
collect human feedback assessing model output alignment from a set of diverse
text prompts. We then use the human-labeled image-text dataset to train a
reward function that predicts human feedback. Lastly, the text-to-image model
is fine-tuned by maximizing reward-weighted likelihood to improve image-text
alignment. Our method generates objects with specified colors, counts and
backgrounds more accurately than the pre-trained model. We also analyze several
design choices and find that careful investigations on such design choices are
important in balancing the alignment-fidelity tradeoffs. Our results
demonstrate the potential for learning from human feedback to significantly
improve text-to-image models.
Related papers
- Removing Distributional Discrepancies in Captions Improves Image-Text Alignment [76.31530836622694]
We introduce a model designed to improve the prediction of image-text alignment.
Our approach focuses on generating high-quality training datasets for the alignment task.
We also demonstrate the applicability of our model by ranking the images generated by text-to-image models based on text alignment.
arXiv Detail & Related papers (2024-10-01T17:50:17Z) - Beyond Thumbs Up/Down: Untangling Challenges of Fine-Grained Feedback for Text-to-Image Generation [67.88747330066049]
Fine-grained feedback captures nuanced distinctions in image quality and prompt-alignment.
We show that demonstrating its superiority to coarse-grained feedback is not automatic.
We identify key challenges in eliciting and utilizing fine-grained feedback.
arXiv Detail & Related papers (2024-06-24T17:19:34Z) - Leveraging Human Revisions for Improving Text-to-Layout Models [16.617352120973806]
We propose using nuanced feedback through the form of human revisions for stronger alignment.
Our method, Revision-Aware Reward Models, allows a generative text-to- text model to produce more modern, designer-aligned layouts.
arXiv Detail & Related papers (2024-05-16T01:33:09Z) - Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation [87.50120181861362]
VisionPrefer is a high-quality and fine-grained preference dataset that captures multiple preference aspects.
We train a reward model VP-Score over VisionPrefer to guide the training of text-to-image generative models and the preference prediction accuracy of VP-Score is comparable to human annotators.
arXiv Detail & Related papers (2024-04-23T14:53:15Z) - Confidence-aware Reward Optimization for Fine-tuning Text-to-Image Models [85.96013373385057]
Fine-tuning text-to-image models with reward functions trained on human feedback data has proven effective for aligning model behavior with human intent.
However, excessive optimization with such reward models, which serve as mere proxy objectives, can compromise the performance of fine-tuned models.
We propose TextNorm, a method that enhances alignment based on a measure of reward model confidence estimated across a set of semantically contrastive text prompts.
arXiv Detail & Related papers (2024-04-02T11:40:38Z) - Rich Human Feedback for Text-to-Image Generation [27.030777546301376]
We collect rich human feedback on 18K generated images (RichHF-18K) and train a multimodal transformer to predict the rich feedback automatically.
We show that the predicted rich human feedback can be leveraged to improve image generation, for example, by selecting high-quality training data to finetune and improve the generative models.
arXiv Detail & Related papers (2023-12-15T22:18:38Z) - Human Learning by Model Feedback: The Dynamics of Iterative Prompting
with Midjourney [28.39697076030535]
This paper analyzes the dynamics of the user prompts along such iterations.
We show that prompts predictably converge toward specific traits along these iterations.
The possibility that users adapt to the model's preference raises concerns about reusing user data for further training.
arXiv Detail & Related papers (2023-11-20T19:28:52Z) - ImageReward: Learning and Evaluating Human Preferences for Text-to-Image
Generation [30.977582244445742]
We build ImageReward, the first general-purpose text-to-image human preference reward model.
Its training is based on our systematic annotation pipeline including rating and ranking.
In human evaluation, ImageReward outperforms existing scoring models and metrics.
arXiv Detail & Related papers (2023-04-12T16:58:13Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.