Related papers: $φ$-DPO: Fairness Direct Preference Optimization Approach to Continual Learning in Large Multimodal Models

$φ$-DPO: Fairness Direct Preference Optimization Approach to Continual Learning in Large Multimodal Models

URL: http://arxiv.org/abs/2602.22601v1
Date: Thu, 26 Feb 2026 04:14:33 GMT
Title: $φ$-DPO: Fairness Direct Preference Optimization Approach to Continual Learning in Large Multimodal Models
Authors: Thanh-Dat Truong, Huu-Thien Tran, Jackson Cothren, Bhiksha Raj, Khoa Luu,
Abstract summary: This paper presents a novel Fairness Direct Preference Optimization (FaiDPO or $$-DPO) framework for continual learning in LMMs.<n>We first propose a new continual learning paradigm based on Direct Preference Optimization (DPO) to mitigate catastrophic forgetting by aligning learning with pairwise preference signals.<n> Extensive experiments and ablation studies show the proposed $$-DPO achieves State-of-the-Art performance across multiple benchmarks.
Score: 58.217707070069885
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Fairness in Continual Learning for Large Multimodal Models (LMMs) is an emerging yet underexplored challenge, particularly in the presence of imbalanced data distributions that can lead to biased model updates and suboptimal performance across tasks. While recent continual learning studies have made progress in addressing catastrophic forgetting, the problem of fairness caused the imbalanced data remains largely underexplored. This paper presents a novel Fairness Direct Preference Optimization (FaiDPO or $φ$-DPO) framework for continual learning in LMMs. In particular, we first propose a new continual learning paradigm based on Direct Preference Optimization (DPO) to mitigate catastrophic forgetting by aligning learning with pairwise preference signals. Then, we identify the limitations of conventional DPO in imbalanced data and present a new $φ$-DPO loss that explicitly addresses distributional biases. We provide a comprehensive theoretical analysis demonstrating that our approach addresses both forgetting and data imbalance. Additionally, to enable $φ$-DPO-based continual learning, we construct pairwise preference annotations for existing benchmarks in the context of continual learning. Extensive experiments and ablation studies show the proposed $φ$-DPO achieves State-of-the-Art performance across multiple benchmarks, outperforming prior continual learning methods of LMMs.

Related papers

Difficulty-Estimated Policy Optimization [38.86673795561421]
We propose Difficulty-Estimated Policy Optimization (DEPO), a novel framework designed to optimize the efficiency and robustness of reasoning alignment.<n>Our approach significantly lowers the computational barrier for training high-performance reasoning models, offering a more sustainable path for reasoning scaling.
arXiv Detail & Related papers (2026-02-06T04:12:23Z)
What Is Preference Optimization Doing, How and Why? [73.78865043839939]
A common belief is that DPO is supervised learning while PPO is reinforcement learning, yet deeper analyses for the reasons underlying these differences remain lacking.<n>First, we examine the target directions of gradient-based updates and find that DPO follows stable targets, whereas PPO follows dynamic targets that balance exploration and exploitation.<n>Second, we examine the roles of positive learning, negative learning, and loss reweighting, which are three key components in PO methods.
arXiv Detail & Related papers (2025-11-30T08:27:59Z)
Distribution Preference Optimization: A Fine-grained Perspective for LLM Unlearning [26.120338506874976]
Unlearning, which aims to remove the influence of specific data while preserving overall model utility, is becoming an important research area.<n>We derive a novel unlearning algorithm termed textbfDistribution textbfPreference textbfOptimization (DiPO)<n>DiPO attains the highest forget quality on the TOFU benchmark, and maintains leading scalability and sustainability on the MUSE benchmark.
arXiv Detail & Related papers (2025-10-06T12:49:00Z)
Modality-Balancing Preference Optimization of Large Multimodal Models by Adversarial Negative Mining [75.14823970163685]
We propose a novel preference learning framework, Modality-Balancing Preference Optimization (MBPO), to address the modality imbalance in LMMs.<n>MBPO constructs a more effective offline preference dataset by generating hard negatives, i.e., rejected responses misled by LLM biases.<n>It can enhance LMM performance on challenging vision-language tasks and effectively reduce hallucinations.
arXiv Detail & Related papers (2025-05-20T03:59:05Z)
A Survey of Direct Preference Optimization [103.59317151002693]
Large Language Models (LLMs) have demonstrated unprecedented generative capabilities.<n>Their alignment with human values remains critical for ensuring helpful and harmless deployments.<n>Direct Preference Optimization (DPO) has recently gained prominence as a streamlined alternative.
arXiv Detail & Related papers (2025-03-12T08:45:15Z)
Preference-Based Multi-Agent Reinforcement Learning: Data Coverage and Algorithmic Techniques [65.55451717632317]
We study Preference-Based Multi-Agent Reinforcement Learning (PbMARL)<n>We identify the Nash equilibrium from a preference-only offline dataset in general-sum games.<n>Our findings underscore the multifaceted approach required for PbMARL.
arXiv Detail & Related papers (2024-09-01T13:14:41Z)
Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC.<n>We increase the consistency and informativeness of the pairwise preference signals through targeted modifications.<n>We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.