Human-Feedback Efficient Reinforcement Learning for Online Diffusion Model Finetuning
- URL: http://arxiv.org/abs/2410.05116v1
- Date: Mon, 7 Oct 2024 15:12:01 GMT
- Title: Human-Feedback Efficient Reinforcement Learning for Online Diffusion Model Finetuning
- Authors: Ayano Hiranaka, Shang-Fu Chen, Chieh-Hsin Lai, Dongjun Kim, Naoki Murata, Takashi Shibuya, Wei-Hsiang Liao, Shao-Hua Sun, Yuki Mitsufuji,
- Abstract summary: HERO is an online training method that captures human feedback and provides informative learning signals for fine-tuning.
HERO can effectively handle tasks like reasoning, counting, personalization, and reducing NSFW content with only 0.5K online feedback.
- Score: 21.707688492630304
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Controllable generation through Stable Diffusion (SD) fine-tuning aims to improve fidelity, safety, and alignment with human guidance. Existing reinforcement learning from human feedback methods usually rely on predefined heuristic reward functions or pretrained reward models built on large-scale datasets, limiting their applicability to scenarios where collecting such data is costly or difficult. To effectively and efficiently utilize human feedback, we develop a framework, HERO, which leverages online human feedback collected on the fly during model learning. Specifically, HERO features two key mechanisms: (1) Feedback-Aligned Representation Learning, an online training method that captures human feedback and provides informative learning signals for fine-tuning, and (2) Feedback-Guided Image Generation, which involves generating images from SD's refined initialization samples, enabling faster convergence towards the evaluator's intent. We demonstrate that HERO is 4x more efficient in online feedback for body part anomaly correction compared to the best existing method. Additionally, experiments show that HERO can effectively handle tasks like reasoning, counting, personalization, and reducing NSFW content with only 0.5K online feedback.
Related papers
- GUIDE: Real-Time Human-Shaped Agents [4.676987516944155]
We introduce GUIDE, a framework for real-time human-guided reinforcement learning.
With only 10 minutes of human feedback, our algorithm achieves up to 30% increase in success rate compared to its RL baseline.
arXiv Detail & Related papers (2024-10-19T18:59:39Z) - Just Say What You Want: Only-prompting Self-rewarding Online Preference Optimization [64.34767799614328]
Current self-rewarding approaches rely heavily on the discriminator's judgment capabilities.
We propose a novel, only-prompting self-rewarding online algorithm that generates preference datasets without relying on judgment capabilities.
arXiv Detail & Related papers (2024-09-26T04:41:08Z) - Iterative Data Smoothing: Mitigating Reward Overfitting and
Overoptimization in RLHF [79.98542868281471]
Reinforcement Learning from Human Feedback (RLHF) is a technique that aligns language models closely with human-centric values.
It is observed that the performance of the reward model degrades after one epoch of training, and optimizing too much against the learned reward model eventually hinders the true objective.
This paper delves into these issues, leveraging the theoretical insights to design improved reward learning algorithm termed 'Iterative Data Smoothing' (IDS)
arXiv Detail & Related papers (2024-01-29T17:43:42Z) - Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset.
We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z) - COURIER: Contrastive User Intention Reconstruction for Large-Scale Visual Recommendation [33.903096803803706]
We argue that a visual feature pre-training method tailored for recommendation is necessary for further improvements beyond existing modality features.
We propose an effective user intention reconstruction module to mine visual features related to user interests from behavior histories.
arXiv Detail & Related papers (2023-06-08T07:45:24Z) - HIVE: Harnessing Human Feedback for Instructional Visual Editing [127.29436858998064]
We present a novel framework to harness human feedback for instructional visual editing (HIVE)
Specifically, we collect human feedback on the edited images and learn a reward function to capture the underlying user preferences.
We then introduce scalable diffusion model fine-tuning methods that can incorporate human preferences based on the estimated reward.
arXiv Detail & Related papers (2023-03-16T19:47:41Z) - Learning New Skills after Deployment: Improving open-domain
internet-driven dialogue with human feedback [22.92577324751342]
We study how to improve internet-driven conversational skills in a learning framework.
We collect deployment data, and collect various types of human feedback.
We find the recently introduced Director model shows significant improvements over other existing approaches.
arXiv Detail & Related papers (2022-08-05T16:41:46Z) - PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via
Relabeling Experience and Unsupervised Pre-training [94.87393610927812]
We present an off-policy, interactive reinforcement learning algorithm that capitalizes on the strengths of both feedback and off-policy learning.
We demonstrate that our approach is capable of learning tasks of higher complexity than previously considered by human-in-the-loop methods.
arXiv Detail & Related papers (2021-06-09T14:10:50Z) - Widening the Pipeline in Human-Guided Reinforcement Learning with
Explanation and Context-Aware Data Augmentation [20.837228359591663]
We present the first study of using human visual explanations in human-in-the-loop reinforcement learning.
We propose EXPAND to encourage the model to encode task-relevant features through a context-aware data augmentation.
arXiv Detail & Related papers (2020-06-26T05:40:05Z) - Facial Feedback for Reinforcement Learning: A Case Study and Offline
Analysis Using the TAMER Framework [51.237191651923666]
We investigate the potential of agent learning from trainers' facial expressions via interpreting them as evaluative feedback.
With designed CNN-RNN model, our analysis shows that telling trainers to use facial expressions and competition can improve the accuracies for estimating positive and negative feedback.
Our results with a simulation experiment show that learning solely from predicted feedback based on facial expressions is possible.
arXiv Detail & Related papers (2020-01-23T17:50:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.