Learning to Discover at Test Time
- URL: http://arxiv.org/abs/2601.16175v1
- Date: Thu, 22 Jan 2026 18:24:00 GMT
- Title: Learning to Discover at Test Time
- Authors: Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, Yu Sun,
- Abstract summary: We use AI to discover a new state of the art for a scientific problem.<n>We call this method Test-Time Training to Discover (TTT-Discover)<n>We report results for every problem we attempted, across mathematics, GPU kernel engineering, algorithm design, and biology.
- Score: 79.84622971773862
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: How can we use AI to discover a new state of the art for a scientific problem? Prior work in test-time scaling, such as AlphaEvolve, performs search by prompting a frozen LLM. We perform reinforcement learning at test time, so the LLM can continue to train, but now with experience specific to the test problem. This form of continual learning is quite special, because its goal is to produce one great solution rather than many good ones on average, and to solve this very problem rather than generalize to other problems. Therefore, our learning objective and search subroutine are designed to prioritize the most promising solutions. We call this method Test-Time Training to Discover (TTT-Discover). Following prior work, we focus on problems with continuous rewards. We report results for every problem we attempted, across mathematics, GPU kernel engineering, algorithm design, and biology. TTT-Discover sets the new state of the art in almost all of them: (i) Erdős' minimum overlap problem and an autocorrelation inequality; (ii) a GPUMode kernel competition (up to $2\times$ faster than prior art); (iii) past AtCoder algorithm competitions; and (iv) denoising problem in single-cell analysis. Our solutions are reviewed by experts or the organizers. All our results are achieved with an open model, OpenAI gpt-oss-120b, and can be reproduced with our publicly available code, in contrast to previous best results that required closed frontier models. Our test-time training runs are performed using Tinker, an API by Thinking Machines, with a cost of only a few hundred dollars per problem.
Related papers
- AutoCode: LLMs as Problem Setters for Competitive Programming [94.71566758494787]
We introduce AutoCode, which uses multiple rounds of validation to yield competition-grade problem statements and test cases.<n>On held-out problems, AutoCode test suites approach 99% consistency with official judgments.
arXiv Detail & Related papers (2025-09-29T17:59:03Z) - Can Multi-turn Self-refined Single Agent LMs with Retrieval Solve Hard Coding Problems? [0.34376560669160394]
This study presents the ICPC benchmark, which consists of 254 international collegiate programming contest (ICPC) tasks.<n>We develop and evaluate a variety of LM inference techniques for competitive programming with these resources.<n>Surprisingly, we discover that o1 can solve 17 out of 18 problems that were previously unsolvable by any model or technique with just a few specific instructions.
arXiv Detail & Related papers (2025-08-30T23:02:12Z) - Counting Cycles with Deepseek [10.137124603866038]
We consider a difficult open problem: How to derive a Computationally Efficient Equivalent Form (CEEF) for the cycle count statistic.<n>We combine a novel approach we propose and the powerful coding skills of AI to solve the problem.<n>We find that, while AI is unable to solve the problem all by itself, it is able to solve it if we provide it with a clear strategy, a step-by-step guidance and carefully written prompts.
arXiv Detail & Related papers (2025-05-23T14:34:40Z) - Navigating the Labyrinth: Evaluating LLMs' Ability to Reason About Search Problems [62.76627483915117]
Large Language Models (LLMs) have recently achieved impressive performance in math and reasoning benchmarks.<n>We introduce a new benchmark, SearchBench, which contains 11 unique search problems inspired by intuitive puzzles.<n>We show that using step-by-step, language-only reasoning, even the most advanced LLMs fail to solve SearchBench.
arXiv Detail & Related papers (2024-06-18T00:44:58Z) - Can Language Models Solve Olympiad Programming? [40.54366634332231]
This paper introduces the USACO benchmark with 307 problems from the USA Computing Olympiad.
We construct and test a range of LM inference methods for competitive programming for the first time.
We find GPT-4 only achieves a 8.7% pass@1 accuracy with zero-shot chain-of-thought prompting.
arXiv Detail & Related papers (2024-04-16T23:27:38Z) - Streaming Algorithms for Learning with Experts: Deterministic Versus
Robust [62.98860182111096]
In the online learning with experts problem, an algorithm must make a prediction about an outcome on each of $T$ days (or times)
The goal is to make a prediction with the minimum cost, specifically compared to the best expert in the set.
We show a space lower bound of $widetildeOmegaleft(fracnMRTright)$ for any deterministic algorithm that achieves regret $R$ when the best expert makes $M$ mistakes.
arXiv Detail & Related papers (2023-03-03T04:39:53Z) - Lifelong Bandit Optimization: No Prior and No Regret [70.94238868711952]
We develop LIBO, an algorithm which adapts to the environment by learning from past experience.
We assume a kernelized structure where the kernel is unknown but shared across all tasks.
Our algorithm can be paired with any kernelized or linear bandit algorithm and guarantees optimal performance.
arXiv Detail & Related papers (2022-10-27T14:48:49Z) - Winning solutions and post-challenge analyses of the ChaLearn AutoDL
challenge 2019 [112.36155380260655]
This paper reports the results and post-challenge analyses of ChaLearn's AutoDL challenge series.
Results show that DL methods dominated, though popular Neural Architecture Search (NAS) was impractical.
A high level modular organization emerged featuring a "meta-learner", "data ingestor", "model selector", "model/learner", and "evaluator"
arXiv Detail & Related papers (2022-01-11T06:21:18Z) - A New Constructive Heuristic driven by Machine Learning for the
Traveling Salesman Problem [8.604882842499212]
Recent systems applying Machine Learning (ML) to solve the Traveling Salesman Problem (TSP) exhibit issues when they try to scale up to real case scenarios.
The use of Candidate Lists (CLs) has been brought up to cope with the issues.
In this work we instead use a machine learning model to confirm the addition in the solution just for high probable edges.
arXiv Detail & Related papers (2021-08-17T21:37:23Z) - Learning from Survey Propagation: a Neural Network for MAX-E-$3$-SAT [0.0]
This paper presents a new algorithm for computing approximate solutions in $Theta(N)$ for the Maximum Exact 3-Satisfiability (MAX-E-$3$-SAT) problem.
arXiv Detail & Related papers (2020-12-10T07:59:54Z) - Physarum Powered Differentiable Linear Programming Layers and
Applications [48.77235931652611]
We propose an efficient and differentiable solver for general linear programming problems.
We show the use of our solver in a video segmentation task and meta-learning for few-shot learning.
arXiv Detail & Related papers (2020-04-30T01:50:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.