Fugu-MT 論文翻訳(概要): Recovering Diversity Without Losing Alignment: A DPO Recipe for Post-Trained LLMs

論文の概要: Recovering Diversity Without Losing Alignment: A DPO Recipe for Post-Trained LLMs

arxiv url: http://arxiv.org/abs/2605.30021v2
Date: Tue, 02 Jun 2026 18:07:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-04 17:40:41.562403
Title: Recovering Diversity Without Losing Alignment: A DPO Recipe for Post-Trained LLMs
Title（参考訳）: 損なうことなく多様性を回復する - 訓練後LLMのためのDPOレシピ
Authors: Vinay Samuel, Yapei Chang, Mohit Iyyer,
Abstract要約: 我々は、異なる有効な回答モードを復元するためのオフラインDPOデータ構築パイプラインであるREDIPOを紹介した。各プロンプトに対して、REDIPOは、ベースモデルとインストラクションモデルの両方からのレスポンスをサンプリングし、インストラクションモデルでベースモデルレスポンスを書き直し、安全性とインストラクションフォロー品質の候補をフィルタリングする。 Qwen3-4B、OLMo-3-7B、LLaMA-3.1-8Bの他、REDIPOはノベルティベンチのディファレンシャル_kを134%、33%、44%改善し、DivPOは多様性を0%、-6%、-4%改善した。
参考スコア（独自算出の注目度）: 26.527631359992125
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Many open-ended instructions have multiple valid answers that users can benefit from seeing, but post-training often narrows an LLM's output space toward a small set of canonical responses. We introduce REDIPO, an offline DPO data-construction pipeline for recovering distinct valid answer modes while preserving the alignment benefits of the instruct model. For each prompt, REDIPO samples responses from both base and instruct models, rewrites base-model responses with the instruct model, filters candidates for safety and instruction-following quality, and builds preference pairs that favor marginally diverse responses among candidates with similar instruction-following reward. Across Qwen3-4B, OLMo-3-7B, and LLaMA-3.1-8B, REDIPO improves NoveltyBench distinct_k by 134%, 33%, and 44% relative to the instruct checkpoints, while DivPO changes diversity by 0%, -6%, and -4% on the same models. These gains largely maintain MTBench, IFEval, and Arena-Hard performance, and reduce direct-category HarmBench attack success rate. Ablations show that marginal-diversity pair selection and base-response rewriting drive the diversity gains, while filtering and quality-bounded pairing help maintain alignment. Overall, our results show that diverse valid answers from base-model generations can be reintroduced through carefully constructed preference data while retaining the alignment benefits of post-training. We release our code and data at https://github.com/vsamuel2003/ReDiPO.
Abstract（参考訳）: 多くのオープンエンド命令は、ユーザーが見ることができる複数の有効な答えを持っているが、ポストトレーニングはLLMの出力空間を少数の標準応答に制限することが多い。インストラクションモデルのアライメントのメリットを保ちながら、明確な有効な回答モードを復元するためのオフラインDPOデータコンストラクションパイプラインであるREDIPOを紹介します。各プロンプトに対して、REDIPOは、ベースモデルとインストラクションモデルの両方からのレスポンスをサンプリングし、インストラクションモデルでベースモデルレスポンスを書き直し、安全性とインストラクションフォロー品質の候補をフィルタリングし、同様のインストラクションフォロー報酬を持つ候補間で、極端に多様なレスポンスを優先するペアを構築する。 Qwen3-4B、OLMo-3-7B、LLaMA-3.1-8Bの他、REDIPOはノベルティベンチのディファレンシャル_kを134%、33%、44%改善し、DivPOは多様性を0%、-6%、-4%改善した。これらの利益は主にMTBench、IFEval、Arena-Hardのパフォーマンスを維持し、直接カテゴリのHarmBench攻撃の成功率を減らす。アブレーションは、限界ダイバーシティペアの選択とベースレスポンスリライトが多様性の向上を促進する一方で、フィルタリングと品質バウンドペアリングはアライメントを維持するのに役立つことを示している。以上の結果から, 事前学習のアライメントの利点を維持しつつ, 慎重に構築された嗜好データを用いて, 基本モデル世代からの多様な有効回答を再導入できることが示唆された。コードとデータはhttps://github.com/vsamuel2003/ReDiPO.comで公開しています。

論文の概要: Recovering Diversity Without Losing Alignment: A DPO Recipe for Post-Trained LLMs

関連論文リスト