RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from
Fine-grained Correctional Human Feedback
- URL: http://arxiv.org/abs/2312.00849v2
- Date: Fri, 8 Mar 2024 06:42:37 GMT
- Title: RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from
Fine-grained Correctional Human Feedback
- Authors: Tianyu Yu and Yuan Yao and Haoye Zhang and Taiwen He and Yifeng Han
and Ganqu Cui and Jinyi Hu and Zhiyuan Liu and Hai-Tao Zheng and Maosong Sun
and Tat-Seng Chua
- Abstract summary: We present RLHF-V, which enhances MLLM trustworthiness via behavior alignment from fine-grained correctional human feedback.
Experiments on five benchmarks in both automatic and human evaluation show that, RLHF-V can enable substantially more trustworthy MLLM behaviors.
- Score: 103.08766858584049
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal Large Language Models (MLLMs) have recently demonstrated
impressive capabilities in multimodal understanding, reasoning, and
interaction. However, existing MLLMs prevalently suffer from serious
hallucination problems, generating text that is not factually grounded in
associated images. The problem makes existing MLLMs untrustworthy and thus
impractical in real-world (especially high-stakes) applications. To address the
challenge, we present RLHF-V, which enhances MLLM trustworthiness via behavior
alignment from fine-grained correctional human feedback. Specifically, RLHF-V
collects human preference in the form of segment-level corrections on
hallucinations, and performs dense direct preference optimization over the
human feedback. Comprehensive experiments on five benchmarks in both automatic
and human evaluation show that, RLHF-V can enable substantially more
trustworthy MLLM behaviors with promising data and computation efficiency.
Remarkably, using 1.4k annotated data samples, RLHF-V significantly reduces the
hallucination rate of the base MLLM by 34.8%, outperforming the concurrent
LLaVA-RLHF trained on 10k annotated data. The final model achieves
state-of-the-art performance in trustworthiness among open-source MLLMs, and
shows better robustness than GPT-4V in preventing hallucinations aroused from
over-generalization. We open-source our code, model, and data at
https://github.com/RLHF-V/RLHF-V.
Related papers
- Language Models Learn to Mislead Humans via RLHF [100.95201965748343]
Language models (LMs) can produce errors that are hard to detect for humans, especially when the task is complex.
We study this phenomenon under a standard RLHF pipeline, calling it "U-SOPHISTRY" since it is Unintended by model developers.
Our results highlight an important failure mode of RLHF and call for more research in assisting humans to align them.
arXiv Detail & Related papers (2024-09-19T14:50:34Z) - Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback [8.601283886845664]
Reinforcement learning from human feedback (RLHF) aligns Large language models (LLMs) with human intentions and values.
Despite its effectiveness and popularity, RLHF is prone to biased local optimization.
We propose a novel textitsequence-to-sequence (seq2seq) reward modeling method.
arXiv Detail & Related papers (2024-08-30T16:14:35Z) - Reward Difference Optimization For Sample Reweighting In Offline RLHF [18.62836654699957]
Current offline RLHF only captures the "ordinal relationship" between responses, overlooking the crucial aspect of how much one is preferred over the others.
We propose a simple yet effective solution called Reward Difference Optimization, shorted as RDO.
Experiments with 7B LLMs on the HH and TL;DR datasets substantiate the effectiveness of our method in both automatic metrics and human evaluation.
arXiv Detail & Related papers (2024-08-18T07:04:16Z) - RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness [94.03511733306296]
We introduce RLAIF-V, a framework that aligns MLLMs in a fully open-source paradigm for super GPT-4V trustworthiness.
RLAIF-V maximally exploits the open-source feedback from two perspectives, including high-quality feedback data and online feedback learning algorithm.
Experiments show that RLAIF-V substantially enhances the trustworthiness of models without sacrificing performance on other tasks.
arXiv Detail & Related papers (2024-05-27T14:37:01Z) - RLHF Workflow: From Reward Modeling to Online RLHF [79.83927049253924]
We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report.
RLHF is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature.
We show that supervised fine-tuning (SFT) and iterative RLHF can obtain state-of-the-art performance with fully open-source datasets.
arXiv Detail & Related papers (2024-05-13T15:50:39Z) - ChatGLM-RLHF: Practices of Aligning Large Language Models with Human Feedback [86.87638927637005]
ChatGLM is a free-to-use AI service powered by large language models (LLMs)
We present the ChatGLM-RLHF pipeline, designed to enhance ChatGLM's alignment with human preferences.
arXiv Detail & Related papers (2024-04-01T05:39:36Z) - Aligning Large Multimodal Models with Factually Augmented RLHF [176.54751941088819]
Large Multimodal Models (LMM) are built across modalities and misalignment between two modalities can result in "hallucination"
We adapt the Reinforcement Learning from Human Feedback (RLHF) from the text domain to the task of vision-language alignment.
We propose a new alignment algorithm called Factually Augmented RLHF that augments the reward model with additional factual information.
Our approach achieves remarkable improvement on the LLaVA-Bench dataset with the 94% performance level of the text-only GPT-4.
arXiv Detail & Related papers (2023-09-25T20:59:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.