Related papers: Scaling Unverifiable Rewards: A Case Study on Visual Insights

Scaling Unverifiable Rewards: A Case Study on Visual Insights

URL: http://arxiv.org/abs/2512.22650v1
Date: Sat, 27 Dec 2025 17:01:38 GMT
Title: Scaling Unverifiable Rewards: A Case Study on Visual Insights
Authors: Shuyu Gan, James Mooney, Pan Hao, Renxiang Wang, Mingyi Hong, Qianwen Wang, Dongyeop Kang,
Abstract summary: Large Language Model (LLM) agents can increasingly automate complex reasoning through Test-Time Scaling (TTS)<n>We propose Selective TTS, a process-based refinement framework that scales inference across different stages in multi-agent pipeline.<n>Our proposed selective TTS then improves insight quality under a fixed compute budget, increasing mean scores from 61.64 to 65.86 while reducing variance.
Score: 29.54766251030519
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Model (LLM) agents can increasingly automate complex reasoning through Test-Time Scaling (TTS), iterative refinement guided by reward signals. However, many real-world tasks involve multi-stage pipeline whose final outcomes lack verifiable rewards or sufficient data to train robust reward models, making judge-based refinement prone to accumulate error over stages. We propose Selective TTS, a process-based refinement framework that scales inference across different stages in multi-agent pipeline, instead of repeated refinement over time by prior work. By distributing compute across stages and pruning low-quality branches early using process-specific judges, Selective TTS mitigates the judge drift and stabilizes refinement. Grounded in the data science pipeline, we build an end-to-end multi-agent pipeline for generating visually insightful charts and report of given dataset, and design a reliable LLM-based judge model, aligned with human experts (Kendall's τ=0.55). Our proposed selective TTS then improves insight quality under a fixed compute budget, increasing mean scores from 61.64 to 65.86 while reducing variance. We hope our findings serve as the first step toward to scaling complex, open-ended tasks with unverifiable rewards, such as scientific discovery and story generation.

Related papers

Benchmarking Few-shot Transferability of Pre-trained Models with Improved Evaluation Protocols [123.73663884421272]
Few-shot transfer has been revolutionized by stronger pre-trained models and improved adaptation algorithms.<n>We establish FEWTRANS, a comprehensive benchmark containing 10 diverse datasets.<n>By releasing FEWTRANS, we aim to provide a rigorous "ruler" to streamline reproducible advances in few-shot transfer learning research.
arXiv Detail & Related papers (2026-02-28T05:41:57Z)
Truncated Step-Level Sampling with Process Rewards for Retrieval-Augmented Reasoning [32.295907409325615]
Training large language models to reason with search engines via reinforcement learning is hindered by a credit assignment problem.<n>We propose SLATE, a framework built on two complementary ideas.<n> Experiments on seven QA benchmarks confirm that SLATE consistently outperforms both sparse-reward and process-reward baselines.
arXiv Detail & Related papers (2026-02-26T19:05:40Z)
SWE-Replay: Efficient Test-Time Scaling for Software Engineering Agents [12.355536750226555]
Test-time scaling has been widely adopted to enhance the capabilities of Large Language Model (LLM) agents in software engineering tasks.<n>We introduce SWE-Replay, the first efficient and generalizable test-time scaling technique for modern agents without reliance on potentially noisy value estimates.<n>Our evaluation shows that, on SWE-Bench Verified, SWE-Replay consistently outperforms naive scaling, reducing costs by up to 17.4% while maintaining or even improving performance by up to 3.8%.
arXiv Detail & Related papers (2026-01-29T18:50:29Z)
FR-TTS: Test-Time Scaling for NTP-based Image Generation with Effective Filling-based Reward Signal [26.72622200307507]
Test-time scaling (TTS) has become a prevalent technique in image generation, significantly boosting output quality.<n>But applying this powerful methodology to the next-token prediction paradigm remains challenging.<n>We introduce the Filling-Based Reward (FR) to estimate the approximate future trajectory of an intermediate sample.<n>We experimentally validate the superiority of FR-TTS over multiple established benchmarks and various reward models.
arXiv Detail & Related papers (2025-11-29T10:34:16Z)
BOTS: A Unified Framework for Bayesian Online Task Selection in LLM Reinforcement Finetuning [82.925106913459]
Reinforcement finetuning (RFT) is a key technique for aligning Large Language Models (LLMs) with human preferences and enhancing reasoning.<n>We introduce BOTS, a unified framework for Bayesian Online Task Selection in RFT reinforcement finetuning.
arXiv Detail & Related papers (2025-10-30T11:15:23Z)
Understanding the Role of Training Data in Test-Time Scaling [56.12341509545198]
We study the performance of test-time scaling for transformers trained on an in-context weight prediction task for linear regression.<n>We show that training on a diverse, relevant, and hard set of tasks results in best performance for test-time scaling.
arXiv Detail & Related papers (2025-10-04T01:38:48Z)
Test time training enhances in-context learning of nonlinear functions [51.56484100374058]
Test-time training (TTT) enhances model performance by explicitly updating designated parameters prior to each prediction.<n>We investigate the combination of TTT with in-context learning (ICL), where the model is given a few examples from the target distribution at inference time.
arXiv Detail & Related papers (2025-09-30T03:56:44Z)
TTS-VAR: A Test-Time Scaling Framework for Visual Auto-Regressive Generation [48.61855865678161]
We present the first general test-time scaling framework for visual auto-regressive ( VAR) models.<n>We propose clustering-based diversity search and resampling-based potential selection.<n>Experiments on the powerful VAR model Infinity show a notable 8.7% GenEval score improvement.
arXiv Detail & Related papers (2025-07-24T16:04:55Z)
Can Pre-training Indicators Reliably Predict Fine-tuning Outcomes of LLMs? [42.608899417822656]
We construct a dataset using 50 1B parameter LLM variants with systematically varied pre-training configurations.<n>We introduce novel unsupervised and supervised proxy metrics derived from pre-training that successfully reduce the relative performance prediction error rate by over 50%.
arXiv Detail & Related papers (2025-04-16T21:19:09Z)
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z)
The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models [69.798277882245]
We introduce Unsupervised Prefix Fine-Tuning (UPFT) to enhance large language models' reasoning efficiency.<n>UPFT removes the need for labeled data or exhaustive sampling.<n> Experiments show that UPFT matches the performance of supervised methods.
arXiv Detail & Related papers (2025-03-04T18:56:03Z)
Scalable Bayesian Tensor Ring Factorization for Multiway Data Analysis [24.04852523970509]
We propose a novel BTR model that incorporates a nonparametric Multiplicative Gamma Process (MGP) prior.<n>To handle discrete data, we introduce the P'olya-Gamma augmentation for closed-form updates.<n>We develop an efficient Gibbs sampler for consistent posterior simulation, which reduces the computational complexity of previous VI algorithm by two orders.
arXiv Detail & Related papers (2024-12-04T13:55:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.