Fugu-MT 論文翻訳(概要): HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs

論文の概要: HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs

arxiv url: http://arxiv.org/abs/2604.20140v1
Date: Wed, 22 Apr 2026 03:08:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-23 15:36:10.94338
Title: HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs
Title（参考訳）: HiPO:LLMにおける適応推論のための階層的推論最適化
Authors: Darsh Kachroo, Adriana Caraeni, Arjun Prasaath Anbazhagan, Brennan Lagasse, Kevin Zhu,
Abstract要約: 応答をセグメントに分割し,各セグメントに対するDPO損失の重み付け和として損失を算出するDPOの拡張であるHiPOを提案する。提案手法は,DPOの計算効率と訓練安定性を維持しつつ,セグメント固有の訓練を可能にする。また,Math Stack Exchange の選好データセット上で,HiPO と DPO を用いて微調整された複数の 7B LLM に対して,HiPO を用いて訓練したモデルが,様々な一般的なベンチマークで他のモデルより優れていることを示す。
参考スコア（独自算出の注目度）: 2.497936211748472
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Direct Preference Optimization (DPO) is an effective framework for aligning large language models with human preferences, but it struggles with complex reasoning tasks. DPO optimizes for the likelihood of generating preferred over dispreferred responses in their entirety and lacks the granularity to provide feedback on subsections of many-step solutions typical of reasoning tasks. Existing methods excel at either stable preference learning (e.g., DPO variants like KTO and RSO) or structured reasoning (e.g., ReMA's multi-agent RL framework, Tree of Thoughts), but fail to merge these complementary strengths. We propose HiPO (Hierarchical Preference Optimization), an extension of DPO that separates responses into reasoning segments (query clarification and context, reasoning steps, and answer) and computes loss as a weighted sum of the DPO loss for each segment. Our approach enables segment-specific training while maintaining DPO's computational efficiency and training stability. We demonstrate that for multiple 7B LLMs fine-tuned using HiPO and DPO on the Math Stack Exchange preference dataset, the models trained with HiPO outperform the others on a variety of common math benchmarks and achieve greater organization, logical flow, and consistency as measured by GPT-4.1.
Abstract（参考訳）: 直接選好最適化(DPO)は、大規模言語モデルと人間の選好を整合させる効果的なフレームワークであるが、複雑な推論タスクに苦慮している。 DPOは、その全体において好ましくない応答を生成する可能性を最適化し、推論タスクに典型的な多段階解のサブセクションに対するフィードバックを提供する粒度を欠いている。既存の方法は、安定な選好学習(KTOやRSOのようなDPO変種)や構造化推論(ReMAのマルチエージェントRLフレームワーク、Tree of Thoughtsなど)で優れているが、これらの相補的な強みを統合できない。本稿では,応答を推論セグメント(クエリの明確化とコンテキスト,推論ステップ,回答)に分離し,各セグメントに対するDPO損失の重み付け和として損失を算出するDPOの拡張であるHiPOを提案する。提案手法は,DPOの計算効率と訓練安定性を維持しつつ,セグメント固有の訓練を可能にする。また,Math Stack Exchange の選好データセット上で,HiPO と DPO を用いて微調整された複数の 7B LLM に対して,HiPO を用いて訓練したモデルが,様々な一般的な数学ベンチマークにおいて他のモデルより優れており,GPT-4.1 で測定した組織,論理フロー,一貫性が向上していることを示す。

論文の概要: HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs

関連論文リスト