Scaling Unverifiable Rewards: A Case Study on Visual Insights
- URL: http://arxiv.org/abs/2512.22650v1
- Date: Sat, 27 Dec 2025 17:01:38 GMT
- Title: Scaling Unverifiable Rewards: A Case Study on Visual Insights
- Authors: Shuyu Gan, James Mooney, Pan Hao, Renxiang Wang, Mingyi Hong, Qianwen Wang, Dongyeop Kang,
- Abstract summary: Large Language Model (LLM) agents can increasingly automate complex reasoning through Test-Time Scaling (TTS)<n>We propose Selective TTS, a process-based refinement framework that scales inference across different stages in multi-agent pipeline.<n>Our proposed selective TTS then improves insight quality under a fixed compute budget, increasing mean scores from 61.64 to 65.86 while reducing variance.
- Score: 29.54766251030519
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Model (LLM) agents can increasingly automate complex reasoning through Test-Time Scaling (TTS), iterative refinement guided by reward signals. However, many real-world tasks involve multi-stage pipeline whose final outcomes lack verifiable rewards or sufficient data to train robust reward models, making judge-based refinement prone to accumulate error over stages. We propose Selective TTS, a process-based refinement framework that scales inference across different stages in multi-agent pipeline, instead of repeated refinement over time by prior work. By distributing compute across stages and pruning low-quality branches early using process-specific judges, Selective TTS mitigates the judge drift and stabilizes refinement. Grounded in the data science pipeline, we build an end-to-end multi-agent pipeline for generating visually insightful charts and report of given dataset, and design a reliable LLM-based judge model, aligned with human experts (Kendall's τ=0.55). Our proposed selective TTS then improves insight quality under a fixed compute budget, increasing mean scores from 61.64 to 65.86 while reducing variance. We hope our findings serve as the first step toward to scaling complex, open-ended tasks with unverifiable rewards, such as scientific discovery and story generation.
Related papers
- Benchmarking Few-shot Transferability of Pre-trained Models with Improved Evaluation Protocols [123.73663884421272]
Few-shot transfer has been revolutionized by stronger pre-trained models and improved adaptation algorithms.<n>We establish FEWTRANS, a comprehensive benchmark containing 10 diverse datasets.<n>By releasing FEWTRANS, we aim to provide a rigorous "ruler" to streamline reproducible advances in few-shot transfer learning research.
arXiv Detail & Related papers (2026-02-28T05:41:57Z) - Truncated Step-Level Sampling with Process Rewards for Retrieval-Augmented Reasoning [32.295907409325615]
Training large language models to reason with search engines via reinforcement learning is hindered by a credit assignment problem.<n>We propose SLATE, a framework built on two complementary ideas.<n> Experiments on seven QA benchmarks confirm that SLATE consistently outperforms both sparse-reward and process-reward baselines.
arXiv Detail & Related papers (2026-02-26T19:05:40Z) - SWE-Replay: Efficient Test-Time Scaling for Software Engineering Agents [12.355536750226555]
Test-time scaling has been widely adopted to enhance the capabilities of Large Language Model (LLM) agents in software engineering tasks.<n>We introduce SWE-Replay, the first efficient and generalizable test-time scaling technique for modern agents without reliance on potentially noisy value estimates.<n>Our evaluation shows that, on SWE-Bench Verified, SWE-Replay consistently outperforms naive scaling, reducing costs by up to 17.4% while maintaining or even improving performance by up to 3.8%.
arXiv Detail & Related papers (2026-01-29T18:50:29Z) - FR-TTS: Test-Time Scaling for NTP-based Image Generation with Effective Filling-based Reward Signal [26.72622200307507]
Test-time scaling (TTS) has become a prevalent technique in image generation, significantly boosting output quality.<n>But applying this powerful methodology to the next-token prediction paradigm remains challenging.<n>We introduce the Filling-Based Reward (FR) to estimate the approximate future trajectory of an intermediate sample.<n>We experimentally validate the superiority of FR-TTS over multiple established benchmarks and various reward models.
arXiv Detail & Related papers (2025-11-29T10:34:16Z) - BOTS: A Unified Framework for Bayesian Online Task Selection in LLM Reinforcement Finetuning [82.925106913459]
Reinforcement finetuning (RFT) is a key technique for aligning Large Language Models (LLMs) with human preferences and enhancing reasoning.<n>We introduce BOTS, a unified framework for Bayesian Online Task Selection in RFT reinforcement finetuning.
arXiv Detail & Related papers (2025-10-30T11:15:23Z) - Understanding the Role of Training Data in Test-Time Scaling [56.12341509545198]
We study the performance of test-time scaling for transformers trained on an in-context weight prediction task for linear regression.<n>We show that training on a diverse, relevant, and hard set of tasks results in best performance for test-time scaling.
arXiv Detail & Related papers (2025-10-04T01:38:48Z) - Test time training enhances in-context learning of nonlinear functions [51.56484100374058]
Test-time training (TTT) enhances model performance by explicitly updating designated parameters prior to each prediction.<n>We investigate the combination of TTT with in-context learning (ICL), where the model is given a few examples from the target distribution at inference time.
arXiv Detail & Related papers (2025-09-30T03:56:44Z) - TTS-VAR: A Test-Time Scaling Framework for Visual Auto-Regressive Generation [48.61855865678161]
We present the first general test-time scaling framework for visual auto-regressive ( VAR) models.<n>We propose clustering-based diversity search and resampling-based potential selection.<n>Experiments on the powerful VAR model Infinity show a notable 8.7% GenEval score improvement.
arXiv Detail & Related papers (2025-07-24T16:04:55Z) - Can Pre-training Indicators Reliably Predict Fine-tuning Outcomes of LLMs? [42.608899417822656]
We construct a dataset using 50 1B parameter LLM variants with systematically varied pre-training configurations.<n>We introduce novel unsupervised and supervised proxy metrics derived from pre-training that successfully reduce the relative performance prediction error rate by over 50%.
arXiv Detail & Related papers (2025-04-16T21:19:09Z) - Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z) - The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models [69.798277882245]
We introduce Unsupervised Prefix Fine-Tuning (UPFT) to enhance large language models' reasoning efficiency.<n>UPFT removes the need for labeled data or exhaustive sampling.<n> Experiments show that UPFT matches the performance of supervised methods.
arXiv Detail & Related papers (2025-03-04T18:56:03Z) - Scalable Bayesian Tensor Ring Factorization for Multiway Data Analysis [24.04852523970509]
We propose a novel BTR model that incorporates a nonparametric Multiplicative Gamma Process (MGP) prior.<n>To handle discrete data, we introduce the P'olya-Gamma augmentation for closed-form updates.<n>We develop an efficient Gibbs sampler for consistent posterior simulation, which reduces the computational complexity of previous VI algorithm by two orders.
arXiv Detail & Related papers (2024-12-04T13:55:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.