Modality-Balancing Preference Optimization of Large Multimodal Models by Adversarial Negative Mining
- URL: http://arxiv.org/abs/2506.08022v2
- Date: Wed, 11 Jun 2025 03:54:30 GMT
- Title: Modality-Balancing Preference Optimization of Large Multimodal Models by Adversarial Negative Mining
- Authors: Chenxi Liu, Tianyi Xiong, Ruibo Chen, Yihan Wu, Junfeng Guo, Tianyi Zhou, Heng Huang,
- Abstract summary: We propose a novel preference learning framework, Modality-Balancing Preference Optimization (MBPO), to address the modality imbalance in LMMs.<n>MBPO constructs a more effective offline preference dataset by generating hard negatives, i.e., rejected responses misled by LLM biases.<n>It can enhance LMM performance on challenging vision-language tasks and effectively reduce hallucinations.
- Score: 66.54211199959298
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The task adaptation and alignment of Large Multimodal Models (LMMs) have been significantly advanced by instruction tuning and further strengthened by recent preference optimization. Yet, most LMMs still suffer from severe modality imbalance during reasoning, i.e., outweighing language prior biases over visual inputs, which bottlenecks their generalization to downstream tasks and causes hallucinations. However, existing preference optimization approaches for LMMs do not focus on restraining the internal biases of their Large Language Model (LLM) backbones when curating the training data. Moreover, they heavily rely on offline data and lack the capacity to explore diverse responses adaptive to dynamic distributional shifts during training. Meanwhile, Group Relative Policy Optimization (GRPO), a recent method using online-generated data and verified rewards to improve reasoning capabilities, remains largely underexplored in LMM alignment. In this paper, we propose a novel preference learning framework, Modality-Balancing Preference Optimization (MBPO), to address the modality imbalance in LMMs. MBPO constructs a more effective offline preference dataset by generating hard negatives, i.e., rejected responses misled by LLM biases due to limited usage of visual information, through adversarial perturbation of input images. Moreover, MBPO leverages the easy-to-verify nature of close-ended tasks to generate online responses with verified rewards. GRPO is then employed to train the model with offline-online hybrid data. Extensive experiments demonstrate that MBPO can enhance LMM performance on challenging vision-language tasks and effectively reduce hallucinations.
Related papers
- When Relevance Meets Novelty: Dual-Stable Periodic Optimization for Exploratory Recommendation [6.663356205396985]
Large language models (LLMs) demonstrate potential with their diverse content generation capabilities.<n>Existing LLM-enhanced dual-model frameworks face two major limitations.<n>First, they overlook long-term preferences driven by group identity, leading to biased interest modeling.<n>Second, they suffer from static optimization flaws, as a one-time alignment process fails to leverage incremental user data for closed-loop optimization.
arXiv Detail & Related papers (2025-08-01T09:10:56Z) - Can Prompt Difficulty be Online Predicted for Accelerating RL Finetuning of Reasoning Models? [62.579951798437115]
This work investigates iterative approximate evaluation for arbitrary prompts.<n>It introduces Model Predictive Prompt Selection (MoPPS), a Bayesian risk-predictive framework.<n>MoPPS reliably predicts prompt difficulty and accelerates training with significantly reduced rollouts.
arXiv Detail & Related papers (2025-07-07T03:20:52Z) - Attention Pruning: Automated Fairness Repair of Language Models via Surrogate Simulated Annealing [14.114970711442512]
This paper introduces Attention Pruning, a fairness-aware simulated annealing approach to prune attention heads in large language models (LLMs)<n>Our experiments show that Attention Pruning achieves up to $40%$ reduction in gender bias and outperforms the state-of-the-art bias mitigation strategies.
arXiv Detail & Related papers (2025-03-20T03:02:32Z) - Dynamic Rewarding with Prompt Optimization Enables Tuning-free Self-Alignment of Language Models [54.381650481255235]
We introduce a new tuning-free approach for self-alignment, Dynamic Rewarding with Prompt Optimization (O)
Our approach leverages a search-based optimization framework that allows LLMs to iteratively self-improve and craft the optimal alignment instructions.
Empirical evaluations on eight recent LLMs, both open and closed-sourced, demonstrate that DRPO significantly enhances alignment performance.
arXiv Detail & Related papers (2024-11-13T16:15:38Z) - Self-Evolutionary Large Language Models through Uncertainty-Enhanced Preference Optimization [9.618391485742968]
Iterative preference optimization has recently become one of the de-facto training paradigms for large language models (LLMs)
We present an uncertainty-enhanced textbfPreference textbfOptimization framework to make the LLM self-evolve with reliable feedback.
Our framework substantially alleviates the noisy problem and improves the performance of iterative preference optimization.
arXiv Detail & Related papers (2024-09-17T14:05:58Z) - Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment [104.18002641195442]
We introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing paired data.
Building on the self-play concept, which autonomously generates negative responses, we further incorporate an off-policy learning pipeline to enhance data exploration and exploitation.
arXiv Detail & Related papers (2024-05-31T14:21:04Z) - Multi-Reference Preference Optimization for Large Language Models [56.84730239046117]
We introduce a novel closed-form formulation for direct preference optimization using multiple reference models.
The resulting algorithm, Multi-Reference Preference Optimization (MRPO), leverages broader prior knowledge from diverse reference models.
Our experiments demonstrate that LLMs finetuned with MRPO generalize better in various preference data, regardless of data scarcity or abundance.
arXiv Detail & Related papers (2024-05-26T00:29:04Z) - Adversarial Preference Optimization: Enhancing Your Alignment via RM-LLM Game [31.66896160733569]
We propose an Adversarial Preference Optimization (APO) framework to target more efficient human preference optimization.
We find the proposed adversarial training framework further enhances existing alignment baselines in terms of LLM helpfulness and harmlessness.
arXiv Detail & Related papers (2023-11-14T10:10:31Z) - Direct Preference Optimization: Your Language Model is Secretly a Reward Model [119.65409513119963]
We introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form.
The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight.
Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods.
arXiv Detail & Related papers (2023-05-29T17:57:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.