Related papers: IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning

IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning

URL: http://arxiv.org/abs/2602.19049v1
Date: Sun, 22 Feb 2026 05:30:14 GMT
Title: IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning
Authors: Yinhan He, Yaochen Zhu, Mingjia Shi, Wendy Zheng, Lin Su, Xiaoqing Wang, Qi Guo, Jundong Li,
Abstract summary: We argue that existing sequence-level reward-shaping methods offer limited control over how reasoning effort is allocated across tokens.<n>We propose IAPO, an information-theoretic post-training framework that assigns token-wise advantages based on each token's conditional mutual information.<n>IAPO consistently improves reasoning accuracy while reducing reasoning length by up to 36%, outperforming existing token-efficient RL methods.
Score: 47.55414301744048
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models increasingly rely on long chains of thought to improve accuracy, yet such gains come with substantial inference-time costs. We revisit token-efficient post-training and argue that existing sequence-level reward-shaping methods offer limited control over how reasoning effort is allocated across tokens. To bridge the gap, we propose IAPO, an information-theoretic post-training framework that assigns token-wise advantages based on each token's conditional mutual information (MI) with the final answer. This yields an explicit, principled mechanism for identifying informative reasoning steps and suppressing low-utility exploration. We provide a theoretical analysis showing that our IAPO can induce monotonic reductions in reasoning verbosity without harming correctness. Empirically, IAPO consistently improves reasoning accuracy while reducing reasoning length by up to 36%, outperforming existing token-efficient RL methods across various reasoning datasets. Extensive empirical evaluations demonstrate that information-aware advantage shaping is a powerful and general direction for token-efficient post-training. The code is available at https://github.com/YinhanHe123/IAPO.

Related papers

ENTRA: Entropy-Based Redundancy Avoidance in Large Language Model Reasoning [30.786062954495403]
Large Reasoning Models (LRMs) often suffer from overthinking, generating unnecessarily long reasoning chains even for simple tasks.<n>We propose ENTRA, an entropy-based training framework that suppresses redundant reasoning while preserving performance.
arXiv Detail & Related papers (2026-01-12T01:26:30Z)
In-Token Rationality Optimization: Towards Accurate and Concise LLM Reasoning via Self-Feedback [38.915062716409686]
InTRO is a new framework that enables both token-level exploration and self-feedback for accurate and concise reasoning.<n>InTRO consistently outperforms other baselines, raising solution accuracy by up to 20% relative to the base model.<n>Its chains of thought are notably more concise, exhibiting reduced verbosity.
arXiv Detail & Related papers (2025-11-13T01:47:06Z)
Distribution Preference Optimization: A Fine-grained Perspective for LLM Unlearning [26.120338506874976]
Unlearning, which aims to remove the influence of specific data while preserving overall model utility, is becoming an important research area.<n>We derive a novel unlearning algorithm termed textbfDistribution textbfPreference textbfOptimization (DiPO)<n>DiPO attains the highest forget quality on the TOFU benchmark, and maintains leading scalability and sustainability on the MUSE benchmark.
arXiv Detail & Related papers (2025-10-06T12:49:00Z)
HiPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs [54.16300997612526]
Large Language Models (LLMs) increasingly rely on Chain-of-Thought (CoT) reasoning to improve accuracy on complex tasks.<n>This paper introduces the Hybrid Policy Optimization (i.e., HiPO), a framework for adaptive reasoning control.<n> Experiments across mathematics and coding benchmarks demonstrate that HiPO can substantially reduce token length while maintaining or improving accuracy.
arXiv Detail & Related papers (2025-09-28T16:46:12Z)
Do Thinking Tokens Help or Trap? Towards More Efficient Large Reasoning Model [7.8354921036790275]
Large Reasoning Models (LRMs) excel at solving complex problems but face an overthinking dilemma.<n>When handling simple tasks, they often produce verbose responses overloaded with thinking tokens.<n>These tokens trigger unnecessary high-level reasoning behaviors like reflection and backtracking, reducing efficiency.
arXiv Detail & Related papers (2025-06-30T13:30:33Z)
IGD: Token Decisiveness Modeling via Information Gain in LLMs for Personalized Recommendation [79.22388408461458]
We introduce an Information Gain-based Decisiveness-aware Token handling (IGD) strategy that integrates token decisiveness into both tuning and decoding.<n>IGD consistently improves recommendation accuracy, achieving significant gains on widely used ranking metrics compared to strong baselines.
arXiv Detail & Related papers (2025-06-16T08:28:19Z)
Think or Not? Exploring Thinking Efficiency in Large Reasoning Models via an Information-Theoretic Lens [51.90059610606049]
This paper revisits the efficiency of such reasoning processes through an information-theoretic lens.<n>We propose two metrics, InfoBias and InfoGain, to quantify divergence from ideal reasoning paths and stepwise information contribution.<n>Motivated by these findings, we introduce an entropy-based Adaptive Think strategy that dynamically halts reasoning once confidence is sufficiently high.
arXiv Detail & Related papers (2025-05-23T13:38:56Z)
Efficient Inference for Large Reasoning Models: A Survey [74.17203483365171]
Large Reasoning Models (LRMs) significantly improve the reasoning ability of Large Language Models (LLMs) by learning to reason.<n>However, their deliberative reasoning process leads to inefficiencies in token usage, memory consumption, and inference time.<n>This survey provides a review of efficient inference methods designed specifically for LRMs, focusing on mitigating token inefficiency while preserving the reasoning quality.
arXiv Detail & Related papers (2025-03-29T13:27:46Z)
The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models [69.798277882245]
We introduce Unsupervised Prefix Fine-Tuning (UPFT) to enhance large language models' reasoning efficiency.<n>UPFT removes the need for labeled data or exhaustive sampling.<n> Experiments show that UPFT matches the performance of supervised methods.
arXiv Detail & Related papers (2025-03-04T18:56:03Z)
Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability [53.51560766150442]
Critical tokens are elements within reasoning trajectories that significantly influence incorrect outcomes.<n>We present a novel framework for identifying these tokens through rollout sampling.<n>We show that identifying and replacing critical tokens significantly improves model accuracy.
arXiv Detail & Related papers (2024-11-29T18:58:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.