HiPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs
- URL: http://arxiv.org/abs/2509.23967v1
- Date: Sun, 28 Sep 2025 16:46:12 GMT
- Title: HiPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs
- Authors: Ken Deng, Zizheng Zhan, Wen Xiang, Wenqiang Zhu, Tianhao Peng, Xinping Lei, Weihao Li, Jingxuan Xu, Kun Wu, Yifan Yao, Haoyang Huang, Huaixi Tang, Kepeng Lei, Zhiyi Lai, Songwei Yu, Zongxian Feng, Zuchen Gao, Weihao Xie, Chenchen Zhang, Yanan Wu, Yuanxing Zhang, Lecheng Huang, Yuqun Zhang, Jie Liu, Zhaoxiang Zhang, Haotian Zhang, Bin Chen, Jiaheng Liu,
- Abstract summary: Large Language Models (LLMs) increasingly rely on chain-of-thought (CoT) reasoning to improve accuracy on complex tasks.<n>This paper introduces the Hybrid Policy Optimization (i.e., HiPO), a framework for adaptive reasoning control.<n> Experiments across mathematics and coding benchmarks demonstrate that HiPO can substantially reduce token length while maintaining or improving accuracy.
- Score: 54.16300997612526
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Large Language Models (LLMs) increasingly rely on chain-of-thought (CoT) reasoning to improve accuracy on complex tasks. However, always generating lengthy reasoning traces is inefficient, leading to excessive token usage and higher inference costs. This paper introduces the Hybrid Policy Optimization (i.e., HiPO), a framework for adaptive reasoning control that enables LLMs to selectively decide when to engage in detailed reasoning (Think-on) and when to respond directly (Think-off). Specifically, HiPO combines a hybrid data pipelineproviding paired Think-on and Think-off responseswith a hybrid reinforcement learning reward system that balances accuracy and efficiency while avoiding over-reliance on detailed reasoning. Experiments across mathematics and coding benchmarks demonstrate that HiPO can substantially reduce token length while maintaining or improving accuracy. Finally, we hope HiPO a can be a principled approach for efficient adaptive reasoning, advancing the deployment of reasoning-oriented LLMs in real-world, resource-sensitive settings.
Related papers
- Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization [60.87651283510059]
Group Relative Policy Optimization (GRPO) effectively scales LLM reasoning but incurs prohibitive computational costs.<n>We propose Dynamic Pruning Policy Optimization (DPPO), a framework that enables dynamic pruning while preserving unbiased gradient estimation.<n>To mitigate the data sparsity induced by pruning, we introduce Dense Prompt Packing, a window-based greedy strategy.
arXiv Detail & Related papers (2026-03-04T14:48:53Z) - IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning [47.55414301744048]
We argue that existing sequence-level reward-shaping methods offer limited control over how reasoning effort is allocated across tokens.<n>We propose IAPO, an information-theoretic post-training framework that assigns token-wise advantages based on each token's conditional mutual information.<n>IAPO consistently improves reasoning accuracy while reducing reasoning length by up to 36%, outperforming existing token-efficient RL methods.
arXiv Detail & Related papers (2026-02-22T05:30:14Z) - Rethinking the Trust Region in LLM Reinforcement Learning [72.25890308541334]
Proximal Policy Optimization (PPO) serves as the de facto standard algorithm for Large Language Models (LLMs)<n>We propose Divergence Proximal Policy Optimization (DPPO), which substitutes clipping with a more principled constraint.<n>DPPO achieves superior training and efficiency compared to existing methods, offering a more robust foundation for RL-based fine-tuning.
arXiv Detail & Related papers (2026-02-04T18:59:04Z) - Thinking on the Fly: Test-Time Reasoning Enhancement via Latent Thought Policy Optimization [5.674809920704963]
Latent Thought Policy Optimization enhances LLM reasoning entirely at test time.<n>Experiments show that LTPO not only matches or surpasses strong baselines on standard tasks but also demonstrates remarkable robustness where others fail.<n>Most notably, on highly challenging AIME benchmarks where existing latent reasoning baselines collapse to near-zero accuracy, LTPO delivers substantial improvements.
arXiv Detail & Related papers (2025-10-05T12:50:39Z) - Accelerating RL for LLM Reasoning with Optimal Advantage Regression [52.0792918455501]
We propose a novel two-stage policy optimization framework that directly approximates the optimal advantage function.<n>$A$*-PO achieves competitive performance across a wide range of mathematical reasoning benchmarks.<n>It reduces training time by up to 2$times$ and peak memory usage by over 30% compared to PPO, GRPO, and REBEL.
arXiv Detail & Related papers (2025-05-27T03:58:50Z) - VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization [59.39976343879587]
VerIPO aims to gradually improve video LLMs' capacity for generating deep, long-term reasoning chains.<n>The training loop benefits from GRPO's expansive search and DPO's targeted optimization.<n>Our trained models exceed the direct inference of large-scale instruction-tuned Video-LLMs.
arXiv Detail & Related papers (2025-05-25T06:41:28Z) - Offline Reinforcement Learning for LLM Multi-Step Reasoning [15.687002884103537]
OREO (Offline Reasoning Optimization) is an offline reinforcement learning method for enhancing multi-step reasoning.<n>It reduces the need to collect pairwise data and enables better credit assignment.<n>It surpasses existing offline learning methods on multi-step reasoning benchmarks.
arXiv Detail & Related papers (2024-12-20T18:49:45Z) - AlphaDPO: Adaptive Reward Margin for Direct Preference Optimization [45.46582930202524]
$alpha$-DPO is an adaptive preference optimization algorithm for large language models.<n>It balances the policy model and the reference model to achieve personalized reward margins.<n>It consistently outperforms DPO and SimPO across various model settings.
arXiv Detail & Related papers (2024-10-14T04:29:57Z) - Minor DPO reject penalty to increase training robustness [8.971332948872185]
Learning from human preference is a paradigm used in large-scale language model (LLM) fine-tuning step to better align pretrained LLM to human preference for downstream task.
Recently, Direct Preference Optimization (DPO) has been proposed to solve the alignment problem with a simplified RL-free method.
In this article, we analyze the working mechanism of $beta$ in DPO, disclose its syntax difference between RL algorithm and DPO, and understand the potential shortage brought by the DPO simplification.
arXiv Detail & Related papers (2024-08-19T09:29:31Z) - ICDPO: Effectively Borrowing Alignment Capability of Others via
In-context Direct Preference Optimization [24.55845271377532]
Large Language Models rely on Human Preference Alignment to ensure the generation of safe content.
We propose a novel approach called In-Context Direct Preference Optimization (ICDPO)
ICDPO generates well-aligned responses as estimated by the aforementioned instant scorer, thereby enhancing the final performance.
arXiv Detail & Related papers (2024-02-14T17:14:34Z) - Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts [95.09994361995389]
Relative Preference Optimization (RPO) is designed to discern between more and less preferred responses derived from both identical and related prompts.
RPO has demonstrated a superior ability to align large language models with user preferences and to improve their adaptability during the training process.
arXiv Detail & Related papers (2024-02-12T22:47:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.