Related papers: Surprisal-Guided Selection: Compute-Optimal Test-Time Strategies for Execution-Grounded Code Generation

Surprisal-Guided Selection: Compute-Optimal Test-Time Strategies for Execution-Grounded Code Generation

URL: http://arxiv.org/abs/2602.07670v1
Date: Sat, 07 Feb 2026 19:29:07 GMT
Title: Surprisal-Guided Selection: Compute-Optimal Test-Time Strategies for Execution-Grounded Code Generation
Authors: Jarrod Barnes,
Abstract summary: We study compute-optimal test-time strategies for verifiable execution-grounded (VEG) tasks.<n>For dense-reward VEG tasks, compute should be allocated to sample diversity and intelligent selection rather than gradient adaptation.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Test-time training (TTT) adapts language models through gradient-based updates at inference. But is adaptation the right strategy? We study compute-optimal test-time strategies for verifiable execution-grounded (VEG) tasks, domains like GPU kernel optimization where a deterministic evaluator provides dense, continuous reward signals. Using KernelBench as our testbed and a 120B-parameter model (GPT-OSS-120B with LoRA adaptation), we find that search outperforms minimal adaptation (1-5 gradient steps): Best-of-N sampling achieves 90% task success (18/20 tasks) at K=64 across the full KernelBench L1 eval set while TTT's best checkpoint reaches only 30.6% (3-seed mean), with TTT's "equivalent K" falling below 1, worse than single-sample inference. The failure mode is over-sharpening: gradient updates collapse diversity toward mediocre solutions rather than discovering optimal ones. Our main contribution is surprisal-guided selection: selecting the highest-surprisal (lowest-confidence) correct sample yields 80% success vs. 50% for most-confident selection, a 30% improvement. Extending to surprisal-guided-top3 matches oracle performance at 100%. This zero-cost strategy, validated through length-controlled analysis, recovers oracle performance. For dense-reward VEG tasks, compute should be allocated to sample diversity and intelligent selection rather than gradient adaptation. The surprisal-guided selection principle may generalize to other execution-grounded domains where optimal solutions occupy the distribution tail.

Related papers

ODAR: Principled Adaptive Routing for LLM Reasoning via Active Inference [60.958331943869126]
ODAR-Expert is an adaptive routing framework that optimize the accuracy-efficiency trade-off via principled resource allocation.<n>We show strong and consistent gains, including 98.2% accuracy on MATH and 54.8% on Humanity's Last Exam.
arXiv Detail & Related papers (2026-02-27T05:22:01Z)
Learning Structural Hardness for Combinatorial Auctions: Instance-Dependent Algorithm Selection via Graph Neural Networks [0.0]
Winner Determination Problem in auctions is NP-hard.<n>No existing method reliably predicts which instances will defeat fast greedys.<n>Recent evidence shows that graph neural networks (GNNs) rarely outperform well-tuned classical methods.
arXiv Detail & Related papers (2026-02-16T14:26:25Z)
Optimal Stopping vs Best-of-$N$ for Inference Time Optimization [11.334978981105559]
We introduce a new framework for inference-time optimization based on the classical Pandora's Box problem.<n>We develop algorithms that decide when to stop generating without knowing the underlying reward distribution.<n>Our results establish a principled bridge between optimal stopping theory and inference-time scaling.
arXiv Detail & Related papers (2025-10-01T19:25:59Z)
C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization for Test-Time Expert Re-Mixing [21.119495676190127]
Mixture-of-Experts (MoE) Large Language Models (LLMs) suffer from severely sub-optimal expert pathways.<n> naive expert selection learned from pretraining leaves a surprising 10-20% accuracy gap for improvement.<n>We develop a novel class of test-time optimization methods to re-weight or "re-mixing" the experts in different layers jointly for each test sample.
arXiv Detail & Related papers (2025-04-10T17:59:56Z)
MetaScale: Test-Time Scaling with Evolving Meta-Thoughts [51.35594569020857]
Experimental results demonstrate that MetaScale consistently outperforms standard inference approaches.<n> METASCALE scales more effectively with increasing sampling budgets and produces more structured, expert-level responses.
arXiv Detail & Related papers (2025-03-17T17:59:54Z)
Influential Language Data Selection via Gradient Trajectory Pursuit [9.925547848971034]
Gradient Trajectory Pursuit (GTP) is an algorithm that performs pursuit of gradient trajectories via jointly selecting data points under an L0-norm regularized objective. In the experiments, we demonstrate the algorithm in both in-domain and target-domain selection benchmarks.
arXiv Detail & Related papers (2024-10-22T05:32:40Z)
Gaussian Process Thompson Sampling via Rootfinding [2.94944680995069]
Thompson sampling (TS) is a simple, effective policy in Bayesian decision making. In continuous optimization, the posterior of the objective function is often a Gaussian process (GP), whose sample paths have numerous local optima. We introduce an efficient global optimization strategy for GP-TS that carefully selects starting points for gradient-based multi-starts.
arXiv Detail & Related papers (2024-10-10T16:06:45Z)
Iterative Reasoning Preference Optimization [84.15992372132507]
We develop an iterative approach to optimize the preference between generated Chain-of-Thought (CoT) candidates. We show reasoning improves across repeated iterations of this scheme. For example, we see a large improvement from 55.6% to 81.6% on GSM8K and an accuracy of 88.7% with majority voting out of 32 samples.
arXiv Detail & Related papers (2024-04-30T17:28:05Z)
AdaNPC: Exploring Non-Parametric Classifier for Test-Time Adaptation [64.9230895853942]
Domain generalization can be arbitrarily hard without exploiting target domain information. Test-time adaptive (TTA) methods are proposed to address this issue. In this work, we adopt Non-Parametric to perform the test-time Adaptation (AdaNPC)
arXiv Detail & Related papers (2023-04-25T04:23:13Z)
Selective Focusing Learning in Conditional GANs [13.264508791149987]
Conditional generative adversarial networks (cGANs) have demonstrated remarkable success due to their class-wise controllability and superior quality for complex generation tasks. This paper proposes a simple but effective training methodology, selective focusing learning, which enforces the discriminator and generator to learn easy samples of each class rapidly.
arXiv Detail & Related papers (2021-07-08T06:06:56Z)
Local policy search with Bayesian optimization [73.0364959221845]
Reinforcement learning aims to find an optimal policy by interaction with an environment. Policy gradients for local search are often obtained from random perturbations. We develop an algorithm utilizing a probabilistic model of the objective function and its gradient.
arXiv Detail & Related papers (2021-06-22T16:07:02Z)
Adaptive Sampling for Best Policy Identification in Markov Decision Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model. The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.