Region-Normalized DPO for Medical Image Segmentation under Noisy Judges
- URL: http://arxiv.org/abs/2601.23222v1
- Date: Fri, 30 Jan 2026 17:45:53 GMT
- Title: Region-Normalized DPO for Medical Image Segmentation under Noisy Judges
- Authors: Hamza Kalisch, Constantin Seibold, Jens Kleesiek, Ken Herrmann, Frederic Jonske,
- Abstract summary: Region-Normalized DPO is a segmentation-aware objective which normalizes preference updates by the size of the disagreement region between masks.<n>It stabilizes preference-based fine-tuning, outperforming standard DPO and strong baselines without requiring additional pixel annotations.
- Score: 7.10111238784554
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While dense pixel-wise annotations remain the gold standard for medical image segmentation, they are costly to obtain and limit scalability. In contrast, many deployed systems already produce inexpensive automatic quality-control (QC) signals like model agreement, uncertainty measures, or learned mask-quality scores which can be used for further model training without additional ground-truth annotation. However, these signals can be noisy and biased, making preference-based fine-tuning susceptible to harmful updates. We study Direct Preference Optimization (DPO) for segmentation from such noisy judges using proposals generated by a supervised base segmenter trained on a small labeled set. We find that outcomes depend strongly on how preference pairs are mined: selecting the judge's top-ranked proposal can improve peak performance when the judge is reliable, but can amplify harmful errors under weaker judges. We propose Region-Normalized DPO (RN-DPO), a segmentation-aware objective which normalizes preference updates by the size of the disagreement region between masks, reducing the leverage of harmful comparisons and improving optimization stability. Across two medical datasets and multiple regimes, RN-DPO improves sustained performance and stabilizes preference-based fine-tuning, outperforming standard DPO and strong baselines without requiring additional pixel annotations.
Related papers
- YaPO: Learnable Sparse Activation Steering Vectors for Domain Adaptation [56.35317441010461]
Yet another Policy Optimization (YaPO) is a textitreference-free method that learns textitsparse steering vectors in the latent space of a Sparse Autoencoder.<n>By optimizing sparse codes, YaPO produces disentangled, interpretable, and efficient steering directions.<n>We show that YaPO converges faster, achieves stronger performance, and exhibits improved training stability compared to dense steering baselines.
arXiv Detail & Related papers (2026-01-13T11:10:13Z) - AMaPO: Adaptive Margin-attached Preference Optimization for Language Model Alignment [25.526336903358757]
offline preference optimization offers a simpler and more stable alternative to RLHF for aligning language models.<n>We propose Adaptive Margin-attached Preference Optimization (AMaPO), a simple yet principled algorithm.<n>AMaPO employs an instance-wise adaptive margin, refined by Z-normalization and exponential scaling, which dynamically reallocates learning effort by amplifying gradients for misranked samples and suppressing them for correct ones.
arXiv Detail & Related papers (2025-11-12T14:51:59Z) - Diffusion-SDPO: Safeguarded Direct Preference Optimization for Diffusion Models [38.27881260102189]
Diffusion-SDPO is a safeguarded update rule that preserves the winner by adaptively scaling the loser gradient according to its alignment with the winner gradient.<n>Our method is simple, model-agnostic, broadly compatible with existing DPO-style alignment frameworks and adds only marginal computational overhead.
arXiv Detail & Related papers (2025-11-05T09:30:49Z) - Lightweight Robust Direct Preference Optimization [26.99327564250612]
We propose DPO-PRO (DPO with Preference Robustness), a robust fine-tuning algorithm based on DPO.<n>Unlike prior DRO-based variants, DPO-PRO focuses solely on uncertainty in preferences, avoiding unnecessary conservatism and incurring negligible computational overhead.
arXiv Detail & Related papers (2025-10-27T17:55:06Z) - Adaptive Margin RLHF via Preference over Preferences [44.328333474444214]
We argue that modeling the strength of preferences can lead to better generalization and more faithful alignment.<n>We introduce an extension to Direct Preference Optimization (DPO), DPO-PoP, that incorporates adaptive margins from preference-over-preference supervision.
arXiv Detail & Related papers (2025-09-26T19:03:24Z) - On Symmetric Losses for Robust Policy Optimization with Noisy Preferences [55.8615920580824]
This work focuses on reward modeling, a core component in reinforcement learning from human feedback.<n>We propose a principled framework for robust policy optimization under noisy preferences.<n>We prove that symmetric losses enable successful policy optimization even under noisy labels.
arXiv Detail & Related papers (2025-05-30T15:30:43Z) - Uncertainty-Penalized Direct Preference Optimization [52.387088396044206]
We develop a pessimistic framework for DPO by introducing preference uncertainty penalization schemes.
The penalization serves as a correction to the loss which attenuates the loss gradient for uncertain samples.
We show improved overall performance compared to vanilla DPO, as well as better completions on prompts from high-uncertainty chosen/rejected responses.
arXiv Detail & Related papers (2024-10-26T14:24:37Z) - Geometric-Averaged Preference Optimization for Soft Preference Labels [78.2746007085333]
Many algorithms for aligning LLMs with human preferences assume that human preferences are binary and deterministic.<n>In this work, we introduce the distributional soft preference labels and improve Direct Preference Optimization (DPO) with a weighted geometric average of the LLM output likelihood in the loss function.
arXiv Detail & Related papers (2024-09-10T17:54:28Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.<n>To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.<n>Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - Proposal Distribution Calibration for Few-Shot Object Detection [65.19808035019031]
In few-shot object detection (FSOD), the two-step training paradigm is widely adopted to mitigate the severe sample imbalance.
Unfortunately, the extreme data scarcity aggravates the proposal distribution bias, hindering the RoI head from evolving toward novel classes.
We introduce a simple yet effective proposal distribution calibration (PDC) approach to neatly enhance the localization and classification abilities of the RoI head.
arXiv Detail & Related papers (2022-12-15T05:09:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.