Fugu-MT 論文翻訳(概要): Preference Robustness for DPO with Applications to Public Health

論文の概要: Preference Robustness for DPO with Applications to Public Health

arxiv url: http://arxiv.org/abs/2509.02709v1
Date: Tue, 02 Sep 2025 18:10:32 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-04 21:40:46.29694
Title: Preference Robustness for DPO with Applications to Public Health
Title（参考訳）: DPOの優先ロバスト性と公衆衛生への応用
Authors: Cheol Woo Kim, Shresth Verma, Mauricio Tec, Milind Tambe,
Abstract要約: 直接選好最適化(DPO)に基づく頑健な微調整アルゴリズムDPO-PROを提案する。 DPO-PROを,非営利組織ARMMANが運営する実世界の母体保健プログラムで評価した。
参考スコア（独自算出の注目度）: 26.99327564250612
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study an LLM fine-tuning task for designing reward functions for sequential resource allocation problems in public health, guided by human preferences expressed in natural language. This setting presents a challenging testbed for alignment due to complex and ambiguous objectives and limited data availability. We propose DPO-PRO, a robust fine-tuning algorithm based on Direct Preference Optimization (DPO), which accounts for uncertainty in the preference distribution using a lightweight Distributionally Robust Optimization (DRO) formulation. Unlike prior DRO-based DPO methods, DPO-PRO is significantly less conservative. We evaluate DPO-PRO on a real-world maternal mobile health program operated by the non-profit organization ARMMAN, as well as on standard alignment benchmarks. Experimental results demonstrate that our method consistently improves robustness to noisy preference signals compared to existing DPO variants. Moreover, DPO-PRO achieves comparable performance to prior self-reflection-based baseline for reward function design, while requiring significantly lower inference-time cost.
Abstract（参考訳）: 自然言語で表現された人間の嗜好によって導かれる公衆衛生における逐次的資源配分問題に対する報酬関数を設計するためのLLM微調整タスクについて検討する。この設定は、複雑で曖昧な目的と限られたデータ可用性のために、アライメントのための挑戦的なテストベッドを提供する。 DPO-PROは、DPO(Direct Preference Optimization)に基づく頑健な微調整アルゴリズムであり、軽量分布ロバスト最適化(DRO)を用いた優先分布の不確かさを考慮に入れている。従来のDROベースのDPO法とは異なり、DPO-PROは極めて保守的ではない。 DPO-PROは、非営利組織ARMMANが運営する実世界の母体保健プログラム、および標準アライメントベンチマークで評価される。実験結果から,提案手法は既存のDPO変種と比較して,ノイズの多い選好信号に対するロバスト性を常に改善することが示された。さらに、DPO-PROは、従来の自己回帰に基づく報酬関数設計のベースラインに匹敵する性能を達成し、推論時間コストを大幅に削減する。

論文の概要: Preference Robustness for DPO with Applications to Public Health

関連論文リスト