Pushing Test-Time Scaling Limits of Deep Search with Asymmetric Verification
- URL: http://arxiv.org/abs/2510.06135v1
- Date: Tue, 07 Oct 2025 17:09:23 GMT
- Title: Pushing Test-Time Scaling Limits of Deep Search with Asymmetric Verification
- Authors: Weihao Zeng, Keqing He, Chuqiao Kuang, Xiaoguang Li, Junxian He,
- Abstract summary: In certain contexts (e.g., solving Sudoku puzzles), verifying responses can be substantially easier than generating them.<n>We study both sequential and parallel TTS of deep search agents, motivated by the intuition that verification in this setting is often much easier than generation.<n>We conduct experiments with flagship open-source models and extend them to their Heavy'' variants through TTS.
- Score: 40.75612723453356
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Test-time compute can be scaled both sequentially and in parallel. Sequential scaling involves lengthening the generation process, while parallel scaling involves verifying and selecting among multiple candidate outputs. Combining these two strategies has led to the most powerful AI systems, such as Grok 4 Heavy and GPT-5 Pro. In certain contexts (e.g., solving Sudoku puzzles), verifying responses can be substantially easier than generating them. This property, referred to as \emph{asymmetric verification}, highlights the strong potential of test-time scaling (TTS). In this work, we study both sequential and parallel TTS of deep search agents, motivated by the intuition that verification in this setting is often much easier than generation. In experiments, we first show that sequential scaling methods, such as budget forcing, can be effective initially but soon degrade performance. Leveraging asymmetric verification, however, we are able to achieve substantial improvements by allocating only a modest amount of compute to the verifier. We conduct experiments with flagship open-source models and extend them to their ``Heavy'' variants through TTS. These deep research agents achieve gains of up to 27 absolute points on benchmarks such as BrowseComp. Remarkably, as an open-source alternative, GLM-4.5 Heavy reaches accuracy of {\bf 54.0\%} on BrowseComp and {\bf 66.0\%} on GAIA, placing it comparable to the best proprietary choices such as OpenAI Deep Research. Tongyi-DeepResearch Heavy further achieves {\bf 69.0\%} accuracy on BrowseComp, greatly surpassing the best proprietary results.
Related papers
- ATTS: Asynchronous Test-Time Scaling via Conformal Prediction [112.54016379556073]
Large language models (LLMs) benefit from test-time scaling but are often hampered by high inference latency.<n>We introduce ATTS (Asynchronous Test-Time Scaling), a statistically guaranteed adaptive scaling framework.<n>We show that ATTS delivers up to 56.7x speedup in test-time scaling and a 4.14x throughput improvement.
arXiv Detail & Related papers (2025-09-18T16:55:09Z) - Trust but Verify! A Survey on Verification Design for Test-time Scaling [8.428618801719198]
Test-time scaling (TTS) has emerged as a new frontier for scaling the performance of Large Language Models.<n>Verifiers serve as reward models that help score the candidate outputs from the decoding process.<n>Verifiers could be prompt-based, fine-tuned as a discriminative or generative model.
arXiv Detail & Related papers (2025-08-20T22:27:21Z) - $\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts [55.231201692232894]
$textttSPECS$ is a latency-aware test-time scaling method inspired by speculative decoding.<n>Our results show that $textttSPECS$matches or surpasses beam search accuracy while reducing latency by up to $sim$19.1%.
arXiv Detail & Related papers (2025-06-15T05:50:05Z) - First Finish Search: Efficient Test-Time Scaling in Large Language Models [20.62274005080048]
First Finish Search (FFS) is a training-free parallel decoding strategy that launches $n$ independent samples and returns as soon as any one completes.<n>FFS achieves $82.23%$ accuracy on the AIME datasets, a $15%$ improvement over DeepSeek-R1's standalone accuracy, nearly matching OpenAI's o4-mini performance.
arXiv Detail & Related papers (2025-05-23T17:57:43Z) - Rethinking Optimal Verification Granularity for Compute-Efficient Test-Time Scaling [4.745268750215421]
Test-time scaling (TTS) has proven effective in enhancing the reasoning capabilities of large language models (LLMs)<n> Verification plays a key role in TTS, simultaneously influencing (1) reasoning performance and (2) compute efficiency, due to the quality and computational cost of verification.<n>We introduce Variable Granularity Search (VG-Search), a unified algorithm that generalizes beam search and Best-of-N sampling via a tunable granularity parameter g.
arXiv Detail & Related papers (2025-05-16T22:24:48Z) - Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z) - Scaling Test-Time Compute Without Verification or RL is Suboptimal [70.28430200655919]
We show that finetuning LLMs with verifier-based (VB) methods based on RL or search is far superior to verifier-free (VF) approaches based on distilling or cloning search traces, given a fixed amount of compute/data budget.<n>We corroborate our theory empirically on both didactic and math reasoning problems with 3/8B-sized pre-trained LLMs, where we find verification is crucial for scaling test-time compute.
arXiv Detail & Related papers (2025-02-17T18:43:24Z) - Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities? [61.85289698610747]
We study whether o1-like large language models (LLMs) truly possess test-time scaling capabilities.<n>We find that longer CoTs of these o1-like models do not consistently enhance accuracy.<n>We propose Shortest Majority Vote, a method that combines parallel scaling strategies with CoT length characteristics.
arXiv Detail & Related papers (2025-02-17T07:21:11Z) - SETS: Leveraging Self-Verification and Self-Correction for Improved Test-Time Scaling [39.57154199908565]
Self-Enhanced Test-Time Scaling (SETS) is a simple yet effective approach that overcomes limitations by strategically combining parallel and sequential techniques.<n>SETS exploits the inherent self-verification and self- computation capabilities of Large Language Models, unifying sampling, verification, and correction within a single framework.<n>Our results show SETS achieves significant performance improvements and more advantageous test-time scaling behavior than the alternatives.
arXiv Detail & Related papers (2025-01-31T17:03:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.