Surprisal-Guided Selection: Compute-Optimal Test-Time Strategies for Execution-Grounded Code Generation
- URL: http://arxiv.org/abs/2602.07670v1
- Date: Sat, 07 Feb 2026 19:29:07 GMT
- Title: Surprisal-Guided Selection: Compute-Optimal Test-Time Strategies for Execution-Grounded Code Generation
- Authors: Jarrod Barnes,
- Abstract summary: We study compute-optimal test-time strategies for verifiable execution-grounded (VEG) tasks.<n>For dense-reward VEG tasks, compute should be allocated to sample diversity and intelligent selection rather than gradient adaptation.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Test-time training (TTT) adapts language models through gradient-based updates at inference. But is adaptation the right strategy? We study compute-optimal test-time strategies for verifiable execution-grounded (VEG) tasks, domains like GPU kernel optimization where a deterministic evaluator provides dense, continuous reward signals. Using KernelBench as our testbed and a 120B-parameter model (GPT-OSS-120B with LoRA adaptation), we find that search outperforms minimal adaptation (1-5 gradient steps): Best-of-N sampling achieves 90% task success (18/20 tasks) at K=64 across the full KernelBench L1 eval set while TTT's best checkpoint reaches only 30.6% (3-seed mean), with TTT's "equivalent K" falling below 1, worse than single-sample inference. The failure mode is over-sharpening: gradient updates collapse diversity toward mediocre solutions rather than discovering optimal ones. Our main contribution is surprisal-guided selection: selecting the highest-surprisal (lowest-confidence) correct sample yields 80% success vs. 50% for most-confident selection, a 30% improvement. Extending to surprisal-guided-top3 matches oracle performance at 100%. This zero-cost strategy, validated through length-controlled analysis, recovers oracle performance. For dense-reward VEG tasks, compute should be allocated to sample diversity and intelligent selection rather than gradient adaptation. The surprisal-guided selection principle may generalize to other execution-grounded domains where optimal solutions occupy the distribution tail.
Related papers
- ODAR: Principled Adaptive Routing for LLM Reasoning via Active Inference [60.958331943869126]
ODAR-Expert is an adaptive routing framework that optimize the accuracy-efficiency trade-off via principled resource allocation.<n>We show strong and consistent gains, including 98.2% accuracy on MATH and 54.8% on Humanity's Last Exam.
arXiv Detail & Related papers (2026-02-27T05:22:01Z) - Learning Structural Hardness for Combinatorial Auctions: Instance-Dependent Algorithm Selection via Graph Neural Networks [0.0]
Winner Determination Problem in auctions is NP-hard.<n>No existing method reliably predicts which instances will defeat fast greedys.<n>Recent evidence shows that graph neural networks (GNNs) rarely outperform well-tuned classical methods.
arXiv Detail & Related papers (2026-02-16T14:26:25Z) - Optimal Stopping vs Best-of-$N$ for Inference Time Optimization [11.334978981105559]
We introduce a new framework for inference-time optimization based on the classical Pandora's Box problem.<n>We develop algorithms that decide when to stop generating without knowing the underlying reward distribution.<n>Our results establish a principled bridge between optimal stopping theory and inference-time scaling.
arXiv Detail & Related papers (2025-10-01T19:25:59Z) - C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization for Test-Time Expert Re-Mixing [21.119495676190127]
Mixture-of-Experts (MoE) Large Language Models (LLMs) suffer from severely sub-optimal expert pathways.<n> naive expert selection learned from pretraining leaves a surprising 10-20% accuracy gap for improvement.<n>We develop a novel class of test-time optimization methods to re-weight or "re-mixing" the experts in different layers jointly for each test sample.
arXiv Detail & Related papers (2025-04-10T17:59:56Z) - MetaScale: Test-Time Scaling with Evolving Meta-Thoughts [51.35594569020857]
Experimental results demonstrate that MetaScale consistently outperforms standard inference approaches.<n> METASCALE scales more effectively with increasing sampling budgets and produces more structured, expert-level responses.
arXiv Detail & Related papers (2025-03-17T17:59:54Z) - Influential Language Data Selection via Gradient Trajectory Pursuit [9.925547848971034]
Gradient Trajectory Pursuit (GTP) is an algorithm that performs pursuit of gradient trajectories via jointly selecting data points under an L0-norm regularized objective.
In the experiments, we demonstrate the algorithm in both in-domain and target-domain selection benchmarks.
arXiv Detail & Related papers (2024-10-22T05:32:40Z) - Gaussian Process Thompson Sampling via Rootfinding [2.94944680995069]
Thompson sampling (TS) is a simple, effective policy in Bayesian decision making.
In continuous optimization, the posterior of the objective function is often a Gaussian process (GP), whose sample paths have numerous local optima.
We introduce an efficient global optimization strategy for GP-TS that carefully selects starting points for gradient-based multi-starts.
arXiv Detail & Related papers (2024-10-10T16:06:45Z) - Iterative Reasoning Preference Optimization [84.15992372132507]
We develop an iterative approach to optimize the preference between generated Chain-of-Thought (CoT) candidates.
We show reasoning improves across repeated iterations of this scheme.
For example, we see a large improvement from 55.6% to 81.6% on GSM8K and an accuracy of 88.7% with majority voting out of 32 samples.
arXiv Detail & Related papers (2024-04-30T17:28:05Z) - AdaNPC: Exploring Non-Parametric Classifier for Test-Time Adaptation [64.9230895853942]
Domain generalization can be arbitrarily hard without exploiting target domain information.
Test-time adaptive (TTA) methods are proposed to address this issue.
In this work, we adopt Non-Parametric to perform the test-time Adaptation (AdaNPC)
arXiv Detail & Related papers (2023-04-25T04:23:13Z) - Selective Focusing Learning in Conditional GANs [13.264508791149987]
Conditional generative adversarial networks (cGANs) have demonstrated remarkable success due to their class-wise controllability and superior quality for complex generation tasks.
This paper proposes a simple but effective training methodology, selective focusing learning, which enforces the discriminator and generator to learn easy samples of each class rapidly.
arXiv Detail & Related papers (2021-07-08T06:06:56Z) - Local policy search with Bayesian optimization [73.0364959221845]
Reinforcement learning aims to find an optimal policy by interaction with an environment.
Policy gradients for local search are often obtained from random perturbations.
We develop an algorithm utilizing a probabilistic model of the objective function and its gradient.
arXiv Detail & Related papers (2021-06-22T16:07:02Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.