Related papers: Enhancing LLM Reasoning via Non-Human-Like Reasoning Path Preference Optimization

Enhancing LLM Reasoning via Non-Human-Like Reasoning Path Preference Optimization

URL: http://arxiv.org/abs/2510.11104v1
Date: Mon, 13 Oct 2025 07:51:16 GMT
Title: Enhancing LLM Reasoning via Non-Human-Like Reasoning Path Preference Optimization
Authors: Junjie Lu, Yuliang Liu, Chaofeng Qu, Wei Shen, Zhouhan Lin, Min Xu,
Abstract summary: We propose Confidence-Guided Reasoning Path Preference Optimization (CGPO)<n>CGPO applies self-generated, non-human-like reasoning-path guidance to mitigate trajectory drift.<n>We show that our method can achieve better performance in most cases compared with approaches using data generated by a strong model or human-annotated.
Score: 40.8414358896996
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current approaches for strengthening LLM reasoning tend to introduce a training bias toward human-like reasoning trajectories. In step-wise preference optimization, in particular, dependence on human or higher-capacity model annotations for intermediate steps limits exploration of alternative, non-human-like reasoning paths and thus constrains achievable performance. Furthermore, through a small-scale pilot study, we observed that in approximately 75% of cases, the model's first erroneous step occurs after the lowest-confidence point. This suggests that guiding the model at its lowest-confidence point before an error provides more accurate supervision than locating the first explicit error. In this paper, we propose Confidence-Guided Reasoning Path Preference Optimization (CGPO), a method that leverages a confidence signal to identify points of maximal uncertainty in the model's reasoning process and applies self-generated, non-human-like reasoning-path guidance to mitigate trajectory drift. Our experiments span diverse models applied to both code and mathematical reasoning tasks. The results show that, with the same amount of training data, our method using data generated by a small model can achieve better performance in most cases compared with approaches using data generated by a strong model or human-annotated.

Related papers

Efficient Inference Using Large Language Models with Limited Human Data: Fine-Tuning then Rectification [2.503562746177713]
We develop a framework that combines fine-tuning and rectification, and optimally allocates limited labeled samples across the two stages.<n>Building on this insight, we leverage empirical scaling laws to develop a data-driven method for optimally splitting samples between the fine-tuning and rectification stages.<n> Empirical analysis validates our framework, demonstrating improved estimation and inference performance compared to using either fine-tuning or rectification alone.
arXiv Detail & Related papers (2025-11-23T05:23:21Z)
ResAD: Normalized Residual Trajectory Modeling for End-to-End Autonomous Driving [64.42138266293202]
ResAD is a Normalized Residual Trajectory Modeling framework.<n>It reframes the learning task to predict the residual deviation from an inertial reference.<n>On the NAVSIM benchmark, ResAD achieves a state-of-the-art PDMS of 88.6 using a vanilla diffusion policy.
arXiv Detail & Related papers (2025-10-09T17:59:36Z)
From Noisy Traces to Stable Gradients: Bias-Variance Optimized Preference Optimization for Aligning Large Reasoning Models [90.45197506653341]
Large reasoning models generate intermediate reasoning traces before producing final answers.<n> aligning LRMs with human preferences, a crucial prerequisite for model deployment, remains underexplored.<n>A common workaround optimized a single sampled trajectory, which introduces substantial gradient variance from trace sampling.
arXiv Detail & Related papers (2025-10-06T17:58:01Z)
Divergence Minimization Preference Optimization for Diffusion Model Alignment [66.31417479052774]
Divergence Minimization Preference Optimization (DMPO) is a principled method for aligning diffusion models by minimizing reverse KL divergence.<n>DMPO can consistently outperform or match existing techniques across different base models and test sets.
arXiv Detail & Related papers (2025-07-10T07:57:30Z)
Intuitionistic Fuzzy Sets for Large Language Model Data Annotation: A Novel Approach to Side-by-Side Preference Labeling [0.0]
This paper introduces a novel framework based on intuitionistic fuzzy sets (IFS) for modeling and aggregating human preferences in large language models (LLMs)<n>Our approach captures not only the degree of preference but also the uncertainty and hesitation inherent in human judgment through membership, non-membership, and hesitation degrees.<n> Experimental validation on multiple datasets demonstrates that our IFS-based approach significantly improves annotation consistency, reduces annotator fatigue, and produces higher-quality preference data.
arXiv Detail & Related papers (2025-05-30T04:20:00Z)
Dissecting the Impact of Model Misspecification in Data-driven Optimization [20.35205476800932]
Data-driven optimization aims to translate a machine learning model into decision-making by optimizing decisions on estimated costs.<n>A more recent approach uses estimation-optimization integration that minimizes decision error instead of estimation error.<n>We show how the integrated approach offers a universal double benefit'' on the top two dominating terms of regret when the underlying model is misspecified.
arXiv Detail & Related papers (2025-03-01T21:31:54Z)
Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning [5.487210426671288]
In this work, we demonstrate that the reasoning abilities of small-scale LMs can be enhanced through self-training. We also show that the conventional self-training can be further augmented by a preference learning algorithm called Direct Preference Optimization.
arXiv Detail & Related papers (2024-07-25T17:59:16Z)
The Hitchhiker's Guide to Human Alignment with *PO [43.4130314879284]
We focus on identifying the algorithm that, while being performant, is simultaneously more robust to varying hyper parameters. Our analysis reveals that the widely adopted DPO method consistently produces lengthy responses of inferior quality. Motivated by these findings, we propose an embarrassingly simple extension to the DPO algorithm, LN-DPO, resulting in more concise responses without sacrificing quality.
arXiv Detail & Related papers (2024-07-21T17:35:20Z)
Preference Learning Algorithms Do Not Learn Preference Rankings [62.335733662381884]
We study the conventional wisdom that preference learning trains models to assign higher likelihoods to more preferred outputs than less preferred outputs. We find that most state-of-the-art preference-tuned models achieve a ranking accuracy of less than 60% on common preference datasets.
arXiv Detail & Related papers (2024-05-29T21:29:44Z)
Let's reward step by step: Step-Level reward model as the Navigators for Reasoning [64.27898739929734]
Process-Supervised Reward Model (PRM) furnishes LLMs with step-by-step feedback during the training phase. We propose a greedy search algorithm that employs the step-level feedback from PRM to optimize the reasoning pathways explored by LLMs. To explore the versatility of our approach, we develop a novel method to automatically generate step-level reward dataset for coding tasks and observed similar improved performance in the code generation tasks.
arXiv Detail & Related papers (2023-10-16T05:21:50Z)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model [119.65409513119963]
We introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods.
arXiv Detail & Related papers (2023-05-29T17:57:46Z)
Provable Reward-Agnostic Preference-Based Reinforcement Learning [61.39541986848391]
Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories. We propose a theoretical reward-agnostic PbRL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired.
arXiv Detail & Related papers (2023-05-29T15:00:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.