Related papers: Re-FORC: Adaptive Reward Prediction for Efficient Chain-of-Thought Reasoning

Re-FORC: Adaptive Reward Prediction for Efficient Chain-of-Thought Reasoning

URL: http://arxiv.org/abs/2511.02130v1
Date: Mon, 03 Nov 2025 23:47:49 GMT
Title: Re-FORC: Adaptive Reward Prediction for Efficient Chain-of-Thought Reasoning
Authors: Renos Zabounidis, Aditya Golatkar, Michael Kleinman, Alessandro Achille, Wei Xia, Stefano Soatto,
Abstract summary: We propose Re-FORC, an adaptive reward prediction method.<n>It enables prediction of the expected future rewards as a function of the number of future thinking tokens.
Score: 85.76121000710522
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose Re-FORC, an adaptive reward prediction method that, given a context, enables prediction of the expected future rewards as a function of the number of future thinking tokens. Re-FORC trains a lightweight adapter on reasoning models, demonstrating improved prediction with longer reasoning and larger models. Re-FORC enables: 1) early stopping of unpromising reasoning chains, reducing compute by 26% while maintaining accuracy, 2) optimized model and thinking length selection that achieves 4% higher accuracy at equal compute and 55% less compute at equal accuracy compared to the largest model, 3) adaptive test-time scaling, which increases accuracy by 11% in high compute regime, and 7% in low compute regime. Re-FORC allows dynamic reasoning with length control via cost-per-token thresholds while estimating computation time upfront.

Related papers

Stepwise Penalization for Length-Efficient Chain-of-Thought Reasoning [66.22060690012512]
Large reasoning models improve with more test-time computation, but often overthink, producing unnecessarily long chains-of-thought that raise cost without improving accuracy.<n>We propose Step-wise Adaptive Penalization (SWAP), a fine-grained framework that allocates length reduction across steps based on intrinsic contribution.
arXiv Detail & Related papers (2026-02-27T20:23:59Z)
Stop Unnecessary Reflection: Training LRMs for Efficient Reasoning with Adaptive Reflection and Length Coordinated Penalty [42.57318973226598]
ARLCP is a reinforcement learning framework designed to balance reasoning efficiency and solution accuracy.<n>We evaluate our method on five mathematical reasoning benchmarks using DeepSeek-R1-Distill-Qwen-1.5B and DeepSeek-R1-Distill-Qwen-7B models.
arXiv Detail & Related papers (2026-02-12T16:04:00Z)
Correct, Concise and Complete: Multi-stage Training For Adaptive Reasoning [11.179446105672461]
We propose a multi-stage efficient reasoning method that combines supervised fine-tuning and reinforcement learning.<n>Our approach reduces response length by an average of 28% for 8B models and 40% for 32B models.<n>It achieves a superior trade-off compared to more complex state-of-the-art efficient reasoning methods.
arXiv Detail & Related papers (2026-01-06T12:31:51Z)
d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models [45.27333046908981]
emphd-TreeRPO is a reliable reinforcement learning framework for dLLMs.<n>We show that emphd-TreeRPO achieves significant gains on multiple reasoning benchmarks.
arXiv Detail & Related papers (2025-12-10T14:20:07Z)
DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning [134.03095505580276]
Doing Length pEnalty Right (DLER) is a training recipe combining batch-wise reward normalization, higher clipping, dynamic sampling, and a simple truncation length penalty.<n>DLER achieves state-of-the-art accuracy--efficiency trade-offs, cutting output length by over 70 percent while surpassing all previous baseline accuracy.
arXiv Detail & Related papers (2025-10-16T20:05:57Z)
Compute-Optimal Quantization-Aware Training [50.98555000360485]
Quantization-aware training (QAT) is a leading technique for improving the accuracy of quantized neural networks.<n>Previous work has shown that decomposing training into a full-precision (FP) phase followed by a QAT phase yields superior accuracy.<n>We investigate how different QAT durations impact final performance.
arXiv Detail & Related papers (2025-09-26T21:09:54Z)
Reinforcement Pre-Training [78.5355979575498]
We introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for large language models and reinforcement learning (RL)<n>RPT offers a scalable method to leverage vast amounts of text data for general-purpose RL, rather than relying on domain-specific annotated answers.<n>The results position RPT as an effective and promising scaling paradigm to advance language model pre-training.
arXiv Detail & Related papers (2025-06-09T17:59:53Z)
Don't Think Longer, Think Wisely: Optimizing Thinking Dynamics for Large Reasoning Models [68.96619605651155]
Large reasoning models (LRMs) may drastically increase the output length due to overthinking.<n>We propose a dynamic optimization framework that segments model-generated reasoning paths into distinct thinking patterns.<n>Our method achieves up to a 12% accuracy improvement and reducing token usage from approximately 5,000 to 3,000 tokens.
arXiv Detail & Related papers (2025-05-27T20:59:29Z)
Electricity Price Prediction Using Multi-Kernel Gaussian Process Regression Combined with Kernel-Based Support Vector Regression [0.0]
This paper presents a new hybrid model for predicting German electricity prices.<n>The algorithm is based on a combination of Gaussian Process Regression (GPR) and Support Vector Regression (SVR)
arXiv Detail & Related papers (2024-11-28T10:32:50Z)
Adaptive Basis Function Selection for Computationally Efficient Predictions [2.1499203845437216]
We develop a method to automatically select the most important BFs for prediction in a sub-domain of the model domain. This significantly reduces the computational complexity of computing predictions while maintaining predictive accuracy.
arXiv Detail & Related papers (2024-08-14T11:53:18Z)
Uncertainty-Aware Time-to-Event Prediction using Deep Kernel Accelerated Failure Time Models [11.171712535005357]
We propose Deep Kernel Accelerated Failure Time models for the time-to-event prediction task. Our model shows better point estimate performance than recurrent neural network based baselines in experiments on two real-world datasets.
arXiv Detail & Related papers (2021-07-26T14:55:02Z)
APQ: Joint Search for Network Architecture, Pruning and Quantization Policy [49.3037538647714]
We present APQ for efficient deep learning inference on resource-constrained hardware. Unlike previous methods that separately search the neural architecture, pruning policy, and quantization policy, we optimize them in a joint manner. With the same accuracy, APQ reduces the latency/energy by 2x/1.3x over MobileNetV2+HAQ.
arXiv Detail & Related papers (2020-06-15T16:09:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.