Related papers: RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback

RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback

URL: http://arxiv.org/abs/2312.00849v2
Date: Fri, 8 Mar 2024 06:42:37 GMT
Title: RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
Authors: Tianyu Yu and Yuan Yao and Haoye Zhang and Taiwen He and Yifeng Han and Ganqu Cui and Jinyi Hu and Zhiyuan Liu and Hai-Tao Zheng and Maosong Sun and Tat-Seng Chua
Abstract summary: We present RLHF-V, which enhances MLLM trustworthiness via behavior alignment from fine-grained correctional human feedback. Experiments on five benchmarks in both automatic and human evaluation show that, RLHF-V can enable substantially more trustworthy MLLM behaviors.
Score: 103.08766858584049
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Large Language Models (MLLMs) have recently demonstrated impressive capabilities in multimodal understanding, reasoning, and interaction. However, existing MLLMs prevalently suffer from serious hallucination problems, generating text that is not factually grounded in associated images. The problem makes existing MLLMs untrustworthy and thus impractical in real-world (especially high-stakes) applications. To address the challenge, we present RLHF-V, which enhances MLLM trustworthiness via behavior alignment from fine-grained correctional human feedback. Specifically, RLHF-V collects human preference in the form of segment-level corrections on hallucinations, and performs dense direct preference optimization over the human feedback. Comprehensive experiments on five benchmarks in both automatic and human evaluation show that, RLHF-V can enable substantially more trustworthy MLLM behaviors with promising data and computation efficiency. Remarkably, using 1.4k annotated data samples, RLHF-V significantly reduces the hallucination rate of the base MLLM by 34.8%, outperforming the concurrent LLaVA-RLHF trained on 10k annotated data. The final model achieves state-of-the-art performance in trustworthiness among open-source MLLMs, and shows better robustness than GPT-4V in preventing hallucinations aroused from over-generalization. We open-source our code, model, and data at https://github.com/RLHF-V/RLHF-V.

Related papers

Generative RLHF-V: Learning Principles from Multi-modal Human Preference [15.068452240642884]
We introduce Generative RLHF-V, a novel alignment framework that integrates GRMs with multi-modal RLHF.<n>We propose a two-stage pipeline: $textbfmulti-modal generative reward modeling from RL$, where RL guides GRMs to actively capture human intention, then predict the correct pair-wise scores.<n>Our framework improves 4 MLLMs' performance across 7 benchmarks by $18.1%$, while the baseline RLHF is only $5.3%$.
arXiv Detail & Related papers (2025-05-24T05:50:07Z)
Linear Probe Penalties Reduce LLM Sycophancy [3.6490659260835234]
Large language models (LLMs) are often sycophantic, prioritizing agreement with their users over accurate or objective statements. This problematic behavior becomes more pronounced during reinforcement learning from human feedback (RLHF) We develop a linear probing method to identify and penalize markers of sycophancy within the reward model, producing rewards that discourage sycophantic behavior.
arXiv Detail & Related papers (2024-12-01T21:11:28Z)
Continual SFT Matches Multimodal RLHF with Negative Supervision [32.784161582943874]
Multimodal RLHF usually happens after supervised finetuning (SFT) stage to continually improve vision-language models' (VLMs) comprehension. Conventional wisdom holds its superiority over continual SFT during this preference alignment stage. We propose a novel negative supervised finetuning (nSFT) approach that fully excavates these information resided.
arXiv Detail & Related papers (2024-11-22T08:48:30Z)
Language Models Learn to Mislead Humans via RLHF [100.95201965748343]
Language models (LMs) can produce errors that are hard to detect for humans, especially when the task is complex. We study this phenomenon under a standard RLHF pipeline, calling it "U-SOPHISTRY" since it is Unintended by model developers. Our results highlight an important failure mode of RLHF and call for more research in assisting humans to align them.
arXiv Detail & Related papers (2024-09-19T14:50:34Z)
Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback [8.601283886845664]
Reinforcement learning from human feedback (RLHF) aligns Large language models (LLMs) with human intentions and values. Despite its effectiveness and popularity, RLHF is prone to biased local optimization. We propose a novel textitsequence-to-sequence (seq2seq) reward modeling method.
arXiv Detail & Related papers (2024-08-30T16:14:35Z)
Reward Difference Optimization For Sample Reweighting In Offline RLHF [18.62836654699957]
Current offline RLHF only captures the "ordinal relationship" between responses, overlooking the crucial aspect of how much one is preferred over the others. We propose a simple yet effective solution called Reward Difference Optimization, shorted as RDO. Experiments with 7B LLMs on the HH and TL;DR datasets substantiate the effectiveness of our method in both automatic metrics and human evaluation.
arXiv Detail & Related papers (2024-08-18T07:04:16Z)
RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness [94.03511733306296]
We introduce RLAIF-V, a framework that aligns MLLMs in a fully open-source paradigm for super GPT-4V trustworthiness. RLAIF-V maximally exploits the open-source feedback from two perspectives, including high-quality feedback data and online feedback learning algorithm. Experiments show that RLAIF-V substantially enhances the trustworthiness of models without sacrificing performance on other tasks.
arXiv Detail & Related papers (2024-05-27T14:37:01Z)
ChatGLM-RLHF: Practices of Aligning Large Language Models with Human Feedback [86.87638927637005]
ChatGLM is a free-to-use AI service powered by large language models (LLMs) We present the ChatGLM-RLHF pipeline, designed to enhance ChatGLM's alignment with human preferences.
arXiv Detail & Related papers (2024-04-01T05:39:36Z)
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models [84.78513908768011]
We propose a novel and efficient method for MLLMs, termed Mixture-of-Resolution Adaptation (MRA) MRA adopts two visual pathways for images with different resolutions, where high-resolution visual information is embedded into the low-resolution pathway. To validate MRA, we apply it to a recent MLLM called LLaVA, and term the new model LLaVA-HR.
arXiv Detail & Related papers (2024-03-05T14:31:24Z)
Aligning Large Multimodal Models with Factually Augmented RLHF [176.54751941088819]
Large Multimodal Models (LMM) are built across modalities and misalignment between two modalities can result in "hallucination" We adapt the Reinforcement Learning from Human Feedback (RLHF) from the text domain to the task of vision-language alignment. We propose a new alignment algorithm called Factually Augmented RLHF that augments the reward model with additional factual information. Our approach achieves remarkable improvement on the LLaVA-Bench dataset with the 94% performance level of the text-only GPT-4.
arXiv Detail & Related papers (2023-09-25T20:59:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.