HIVE: Harnessing Human Feedback for Instructional Visual Editing
- URL: http://arxiv.org/abs/2303.09618v2
- Date: Tue, 26 Mar 2024 22:59:52 GMT
- Title: HIVE: Harnessing Human Feedback for Instructional Visual Editing
- Authors: Shu Zhang, Xinyi Yang, Yihao Feng, Can Qin, Chia-Chih Chen, Ning Yu, Zeyuan Chen, Huan Wang, Silvio Savarese, Stefano Ermon, Caiming Xiong, Ran Xu,
- Abstract summary: We present a novel framework to harness human feedback for instructional visual editing (HIVE)
Specifically, we collect human feedback on the edited images and learn a reward function to capture the underlying user preferences.
We then introduce scalable diffusion model fine-tuning methods that can incorporate human preferences based on the estimated reward.
- Score: 127.29436858998064
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Incorporating human feedback has been shown to be crucial to align text generated by large language models to human preferences. We hypothesize that state-of-the-art instructional image editing models, where outputs are generated based on an input image and an editing instruction, could similarly benefit from human feedback, as their outputs may not adhere to the correct instructions and preferences of users. In this paper, we present a novel framework to harness human feedback for instructional visual editing (HIVE). Specifically, we collect human feedback on the edited images and learn a reward function to capture the underlying user preferences. We then introduce scalable diffusion model fine-tuning methods that can incorporate human preferences based on the estimated reward. Besides, to mitigate the bias brought by the limitation of data, we contribute a new 1M training dataset, a 3.6K reward dataset for rewards learning, and a 1K evaluation dataset to boost the performance of instructional image editing. We conduct extensive empirical experiments quantitatively and qualitatively, showing that HIVE is favored over previous state-of-the-art instructional image editing approaches by a large margin.
Related papers
- Human-Feedback Efficient Reinforcement Learning for Online Diffusion Model Finetuning [21.707688492630304]
HERO is an online training method that captures human feedback and provides informative learning signals for fine-tuning.
HERO can effectively handle tasks like reasoning, counting, personalization, and reducing NSFW content with only 0.5K online feedback.
arXiv Detail & Related papers (2024-10-07T15:12:01Z) - Beyond Thumbs Up/Down: Untangling Challenges of Fine-Grained Feedback for Text-to-Image Generation [67.88747330066049]
Fine-grained feedback captures nuanced distinctions in image quality and prompt-alignment.
We show that demonstrating its superiority to coarse-grained feedback is not automatic.
We identify key challenges in eliciting and utilizing fine-grained feedback.
arXiv Detail & Related papers (2024-06-24T17:19:34Z) - Leveraging Human Revisions for Improving Text-to-Layout Models [16.617352120973806]
We propose using nuanced feedback through the form of human revisions for stronger alignment.
Our method, Revision-Aware Reward Models, allows a generative text-to- text model to produce more modern, designer-aligned layouts.
arXiv Detail & Related papers (2024-05-16T01:33:09Z) - Constructive Large Language Models Alignment with Diverse Feedback [76.9578950893839]
We introduce Constructive and Diverse Feedback (CDF) as a novel method to enhance large language models alignment.
We exploit critique feedback for easy problems, refinement feedback for medium problems, and preference feedback for hard problems.
By training our model with this diversified feedback, we achieve enhanced alignment performance while using less training data.
arXiv Detail & Related papers (2023-10-10T09:20:14Z) - Putting Humans in the Image Captioning Loop [8.584932159968002]
We present work-in-progress on adapting an IC system to integrate human feedback.
Our approach builds on a base IC model pre-trained on the MS COCO dataset, which generates captions for unseen images.
We hope that this approach, while leading to improved results, will also result in customizable IC models.
arXiv Detail & Related papers (2023-06-06T07:50:46Z) - ImageReward: Learning and Evaluating Human Preferences for Text-to-Image
Generation [30.977582244445742]
We build ImageReward, the first general-purpose text-to-image human preference reward model.
Its training is based on our systematic annotation pipeline including rating and ranking.
In human evaluation, ImageReward outperforms existing scoring models and metrics.
arXiv Detail & Related papers (2023-04-12T16:58:13Z) - Learning Transferable Pedestrian Representation from Multimodal
Information Supervision [174.5150760804929]
VAL-PAT is a novel framework that learns transferable representations to enhance various pedestrian analysis tasks with multimodal information.
We first perform pre-training on LUPerson-TA dataset, where each image contains text and attribute annotations.
We then transfer the learned representations to various downstream tasks, including person reID, person attribute recognition and text-based person search.
arXiv Detail & Related papers (2023-04-12T01:20:58Z) - VILA: Learning Image Aesthetics from User Comments with Vision-Language
Pretraining [53.470662123170555]
We propose learning image aesthetics from user comments, and exploring vision-language pretraining methods to learn multimodal aesthetic representations.
Specifically, we pretrain an image-text encoder-decoder model with image-comment pairs, using contrastive and generative objectives to learn rich and generic aesthetic semantics without human labels.
Our results show that our pretrained aesthetic vision-language model outperforms prior works on image aesthetic captioning over the AVA-Captions dataset.
arXiv Detail & Related papers (2023-03-24T23:57:28Z) - Aligning Text-to-Image Models using Human Feedback [104.76638092169604]
Current text-to-image models often generate images that are inadequately aligned with text prompts.
We propose a fine-tuning method for aligning such models using human feedback.
Our results demonstrate the potential for learning from human feedback to significantly improve text-to-image models.
arXiv Detail & Related papers (2023-02-23T17:34:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.