CG-TTRL: Context-Guided Test-Time Reinforcement Learning for On-Device Large Language Models
- URL: http://arxiv.org/abs/2511.06430v1
- Date: Sun, 09 Nov 2025 15:51:52 GMT
- Title: CG-TTRL: Context-Guided Test-Time Reinforcement Learning for On-Device Large Language Models
- Authors: Peyman Hosseini, Ondrej Bohdal, Taha Ceritli, Ignacio Castro, Matthew Purver, Mete Ozay, Umberto Michieli,
- Abstract summary: Test-time Reinforcement Learning (TTRL) has shown promise in adapting foundation models for complex tasks at test-time.<n>We propose context-guided TTRL, integrating context dynamically into both sampling phases and propose a method for efficient context selection for on-device applications.
- Score: 37.06397567773862
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Test-time Reinforcement Learning (TTRL) has shown promise in adapting foundation models for complex tasks at test-time, resulting in large performance improvements. TTRL leverages an elegant two-phase sampling strategy: first, multi-sampling derives a pseudo-label via majority voting, while subsequent downsampling and reward-based fine-tuning encourages the model to explore and learn diverse valid solutions, with the pseudo-label modulating the reward signal. Meanwhile, in-context learning has been widely explored at inference time and demonstrated the ability to enhance model performance without weight updates. However, TTRL's two-phase sampling strategy under-utilizes contextual guidance, which can potentially improve pseudo-label accuracy in the initial exploitation phase while regulating exploration in the second. To address this, we propose context-guided TTRL (CG-TTRL), integrating context dynamically into both sampling phases and propose a method for efficient context selection for on-device applications. Our evaluations on mathematical and scientific QA benchmarks show CG-TTRL outperforms TTRL (e.g. additional 7% relative accuracy improvement over TTRL), while boosting efficiency by obtaining strong performance after only a few steps of test-time training (e.g. 8% relative improvement rather than 1% over TTRL after 3 steps).
Related papers
- SWE-RM: Execution-free Feedback For Software Engineering Agents [61.86380395896069]
Execution-based feedback is widely used in the development of coding agents through test-time scaling (TTS) and reinforcement learning (RL)<n>In contrast, execution-free feedback from reward models can provide more fine-grained signals without depending on unit test cases.<n>We introduce SWE-RM, an accurate and robust reward model adopting a mixture-of-experts architecture with 30B total parameters and 3B activated during inference.
arXiv Detail & Related papers (2025-12-26T08:26:18Z) - CoT-Saliency: Unified Chain-of-Thought Reasoning for Heterogeneous Saliency Tasks [96.64597365827046]
We present the first unified framework that jointly handles three operationally heterogeneous saliency tasks.<n>We introduce a Chain-of-Thought (CoT) reasoning process in a Vision-Language Model (VLM) to bridge task heterogeneity.<n>We show our model matches or outperforms specialized SOTA methods and strong closed-source VLMs across all tasks.
arXiv Detail & Related papers (2025-11-01T04:37:01Z) - Harnessing the Power of Reinforcement Learning for Adaptive MCMC [6.313580378481795]
Reinforcement Learning Metropolis-Hastings (RLMH) is a Markov decision process.<n>This paper shows that natural choices of reward, such as the acceptance rate, or the expected squared jump distance, provide insufficient signal for training RLMH.<n>We present adaptive gradient-based samplers that balance flexibility of the Markov transition kernel with learnability of the associated RL task.
arXiv Detail & Related papers (2025-07-01T11:12:34Z) - Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs [51.21041884010009]
Ring-lite is a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL)<n>Our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks.
arXiv Detail & Related papers (2025-06-17T17:12:34Z) - TTRL: Test-Time Reinforcement Learning [31.351608137721875]
Test-Time Reinforcement Learning (TTRL) is a novel method for training Large Language Models (LLMs) on unlabeled data.<n>Our experiments demonstrate that TTRL consistently improves performance across a variety of tasks and models.
arXiv Detail & Related papers (2025-04-22T17:59:56Z) - Fast T2T: Optimization Consistency Speeds Up Diffusion-Based Training-to-Testing Solving for Combinatorial Optimization [83.65278205301576]
We propose to learn direct mappings from different noise levels to the optimal solution for a given instance, facilitating high-quality generation with minimal shots.<n>This is achieved through an optimization consistency training protocol, which minimizes the difference among samples.<n>Experiments on two popular tasks, the Traveling Salesman Problem (TSP) and Maximal Independent Set (MIS), demonstrate the superiority of Fast T2T regarding both solution quality and efficiency.
arXiv Detail & Related papers (2025-02-05T07:13:43Z) - LoRA-TTT: Low-Rank Test-Time Training for Vision-Language Models [23.218237408724676]
We propose LoRA-TTT, a novel Test-Time Training (TTT) method for vision-language models (VLMs)<n>By introducing LoRA and updating only its parameters during test time, our method offers a simple yet effective TTT approach.<n>Our method can adapt to diverse domains by combining these two losses, without increasing memory consumption or runtime.
arXiv Detail & Related papers (2025-02-04T07:40:26Z) - Bridging SFT and DPO for Diffusion Model Alignment with Self-Sampling Preference Optimization [67.8738082040299]
Self-Sampling Preference Optimization (SSPO) is a new alignment method for post-training reinforcement learning.<n>SSPO eliminates the need for paired data and reward models while retaining the training stability of SFT.<n>SSPO surpasses all previous approaches on the text-to-image benchmarks and demonstrates outstanding performance on the text-to-video benchmarks.
arXiv Detail & Related papers (2024-10-07T17:56:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.