Re-FORC: Adaptive Reward Prediction for Efficient Chain-of-Thought Reasoning
- URL: http://arxiv.org/abs/2511.02130v1
- Date: Mon, 03 Nov 2025 23:47:49 GMT
- Title: Re-FORC: Adaptive Reward Prediction for Efficient Chain-of-Thought Reasoning
- Authors: Renos Zabounidis, Aditya Golatkar, Michael Kleinman, Alessandro Achille, Wei Xia, Stefano Soatto,
- Abstract summary: We propose Re-FORC, an adaptive reward prediction method.<n>It enables prediction of the expected future rewards as a function of the number of future thinking tokens.
- Score: 85.76121000710522
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose Re-FORC, an adaptive reward prediction method that, given a context, enables prediction of the expected future rewards as a function of the number of future thinking tokens. Re-FORC trains a lightweight adapter on reasoning models, demonstrating improved prediction with longer reasoning and larger models. Re-FORC enables: 1) early stopping of unpromising reasoning chains, reducing compute by 26% while maintaining accuracy, 2) optimized model and thinking length selection that achieves 4% higher accuracy at equal compute and 55% less compute at equal accuracy compared to the largest model, 3) adaptive test-time scaling, which increases accuracy by 11% in high compute regime, and 7% in low compute regime. Re-FORC allows dynamic reasoning with length control via cost-per-token thresholds while estimating computation time upfront.
Related papers
- Stepwise Penalization for Length-Efficient Chain-of-Thought Reasoning [66.22060690012512]
Large reasoning models improve with more test-time computation, but often overthink, producing unnecessarily long chains-of-thought that raise cost without improving accuracy.<n>We propose Step-wise Adaptive Penalization (SWAP), a fine-grained framework that allocates length reduction across steps based on intrinsic contribution.
arXiv Detail & Related papers (2026-02-27T20:23:59Z) - Stop Unnecessary Reflection: Training LRMs for Efficient Reasoning with Adaptive Reflection and Length Coordinated Penalty [42.57318973226598]
ARLCP is a reinforcement learning framework designed to balance reasoning efficiency and solution accuracy.<n>We evaluate our method on five mathematical reasoning benchmarks using DeepSeek-R1-Distill-Qwen-1.5B and DeepSeek-R1-Distill-Qwen-7B models.
arXiv Detail & Related papers (2026-02-12T16:04:00Z) - Correct, Concise and Complete: Multi-stage Training For Adaptive Reasoning [11.179446105672461]
We propose a multi-stage efficient reasoning method that combines supervised fine-tuning and reinforcement learning.<n>Our approach reduces response length by an average of 28% for 8B models and 40% for 32B models.<n>It achieves a superior trade-off compared to more complex state-of-the-art efficient reasoning methods.
arXiv Detail & Related papers (2026-01-06T12:31:51Z) - d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models [45.27333046908981]
emphd-TreeRPO is a reliable reinforcement learning framework for dLLMs.<n>We show that emphd-TreeRPO achieves significant gains on multiple reasoning benchmarks.
arXiv Detail & Related papers (2025-12-10T14:20:07Z) - DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning [134.03095505580276]
Doing Length pEnalty Right (DLER) is a training recipe combining batch-wise reward normalization, higher clipping, dynamic sampling, and a simple truncation length penalty.<n>DLER achieves state-of-the-art accuracy--efficiency trade-offs, cutting output length by over 70 percent while surpassing all previous baseline accuracy.
arXiv Detail & Related papers (2025-10-16T20:05:57Z) - Compute-Optimal Quantization-Aware Training [50.98555000360485]
Quantization-aware training (QAT) is a leading technique for improving the accuracy of quantized neural networks.<n>Previous work has shown that decomposing training into a full-precision (FP) phase followed by a QAT phase yields superior accuracy.<n>We investigate how different QAT durations impact final performance.
arXiv Detail & Related papers (2025-09-26T21:09:54Z) - Reinforcement Pre-Training [78.5355979575498]
We introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for large language models and reinforcement learning (RL)<n>RPT offers a scalable method to leverage vast amounts of text data for general-purpose RL, rather than relying on domain-specific annotated answers.<n>The results position RPT as an effective and promising scaling paradigm to advance language model pre-training.
arXiv Detail & Related papers (2025-06-09T17:59:53Z) - Don't Think Longer, Think Wisely: Optimizing Thinking Dynamics for Large Reasoning Models [68.96619605651155]
Large reasoning models (LRMs) may drastically increase the output length due to overthinking.<n>We propose a dynamic optimization framework that segments model-generated reasoning paths into distinct thinking patterns.<n>Our method achieves up to a 12% accuracy improvement and reducing token usage from approximately 5,000 to 3,000 tokens.
arXiv Detail & Related papers (2025-05-27T20:59:29Z) - Electricity Price Prediction Using Multi-Kernel Gaussian Process Regression Combined with Kernel-Based Support Vector Regression [0.0]
This paper presents a new hybrid model for predicting German electricity prices.<n>The algorithm is based on a combination of Gaussian Process Regression (GPR) and Support Vector Regression (SVR)
arXiv Detail & Related papers (2024-11-28T10:32:50Z) - Adaptive Basis Function Selection for Computationally Efficient Predictions [2.1499203845437216]
We develop a method to automatically select the most important BFs for prediction in a sub-domain of the model domain.
This significantly reduces the computational complexity of computing predictions while maintaining predictive accuracy.
arXiv Detail & Related papers (2024-08-14T11:53:18Z) - Uncertainty-Aware Time-to-Event Prediction using Deep Kernel Accelerated
Failure Time Models [11.171712535005357]
We propose Deep Kernel Accelerated Failure Time models for the time-to-event prediction task.
Our model shows better point estimate performance than recurrent neural network based baselines in experiments on two real-world datasets.
arXiv Detail & Related papers (2021-07-26T14:55:02Z) - APQ: Joint Search for Network Architecture, Pruning and Quantization
Policy [49.3037538647714]
We present APQ for efficient deep learning inference on resource-constrained hardware.
Unlike previous methods that separately search the neural architecture, pruning policy, and quantization policy, we optimize them in a joint manner.
With the same accuracy, APQ reduces the latency/energy by 2x/1.3x over MobileNetV2+HAQ.
arXiv Detail & Related papers (2020-06-15T16:09:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.