Fugu-MT 論文翻訳(概要): Pushing Test-Time Scaling Limits of Deep Search with Asymmetric Verification

論文の概要: Pushing Test-Time Scaling Limits of Deep Search with Asymmetric Verification

arxiv url: http://arxiv.org/abs/2510.06135v1
Date: Tue, 07 Oct 2025 17:09:23 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-08 17:57:08.373945
Title: Pushing Test-Time Scaling Limits of Deep Search with Asymmetric Verification
Title（参考訳）: 非対称検証による深部探索の試験時間スケーリング限界のプッシュ
Authors: Weihao Zeng, Keqing He, Chuqiao Kuang, Xiaoguang Li, Junxian He,
Abstract要約: 特定の文脈(例えば、スドゥークパズルの解法)において、応答の検証はそれらを生成するよりもはるかに容易である。深層探索エージェントの逐次的かつ並列的なTSについて検討し、この設定での検証は生成よりもはるかに容易である、という直感に動機づけられた。我々は、フラッグシップのオープンソースモデルで実験を行い、それらをTSを通じてHeavy'の亜種に拡張します。
参考スコア（独自算出の注目度）: 40.75612723453356
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Test-time compute can be scaled both sequentially and in parallel. Sequential scaling involves lengthening the generation process, while parallel scaling involves verifying and selecting among multiple candidate outputs. Combining these two strategies has led to the most powerful AI systems, such as Grok 4 Heavy and GPT-5 Pro. In certain contexts (e.g., solving Sudoku puzzles), verifying responses can be substantially easier than generating them. This property, referred to as \emph{asymmetric verification}, highlights the strong potential of test-time scaling (TTS). In this work, we study both sequential and parallel TTS of deep search agents, motivated by the intuition that verification in this setting is often much easier than generation. In experiments, we first show that sequential scaling methods, such as budget forcing, can be effective initially but soon degrade performance. Leveraging asymmetric verification, however, we are able to achieve substantial improvements by allocating only a modest amount of compute to the verifier. We conduct experiments with flagship open-source models and extend them to their ``Heavy'' variants through TTS. These deep research agents achieve gains of up to 27 absolute points on benchmarks such as BrowseComp. Remarkably, as an open-source alternative, GLM-4.5 Heavy reaches accuracy of {\bf 54.0\%} on BrowseComp and {\bf 66.0\%} on GAIA, placing it comparable to the best proprietary choices such as OpenAI Deep Research. Tongyi-DeepResearch Heavy further achieves {\bf 69.0\%} accuracy on BrowseComp, greatly surpassing the best proprietary results.
Abstract（参考訳）: テスト時間計算は逐次と並列の両方でスケールできる。逐次スケーリングは生成プロセスを延長し、並列スケーリングは複数の候補出力の検証と選択を含む。これら2つの戦略を組み合わせることで、Grok 4 HeavyやGPT-5 Proといった、最も強力なAIシステムが誕生した。ある種の文脈(例えば、スドゥークパズルの解法)では、応答の検証はそれらを生成するよりもはるかに容易である。この性質は 'emph{asymmetric validation} と呼ばれ、テスト時間スケーリング(TTS)の強い可能性を強調する。本研究では, 深層探索エージェントの逐次的かつ並列的なTSについて検討し, この設定での検証は生成よりもはるかに容易であるという直感に動機づけられた。実験では、まず、予算強制のような逐次スケーリング手法が最初は有効であったが、すぐに性能が低下することを示した。しかし、非対称な検証を活用すれば、検証器に最小の計算量だけを割り当てることで、実質的な改善が達成できる。フラッグシップのオープンソースモデルを用いて実験を行い、それらをTSを通して `Heavy' の変種に拡張する。これらのディープリサーチエージェントは、BrowseCompのようなベンチマークで最大27の絶対点を獲得している。注目すべきは、オープンソースの代替として、GLM-4.5 Heavy は BrowseComp 上で {\bf 54.0\%} 、GAIA 上で {\bf 66.0\%} の精度に達し、OpenAI Deep Research のような最高のプロプライエタリな選択に匹敵するものである。 Tongyi-DeepResearch Heavy はさらに BrowseComp の精度 {\bf 69.0\%} を達成し、プロプライエタリな結果をはるかに上回っている。

論文の概要: Pushing Test-Time Scaling Limits of Deep Search with Asymmetric Verification

関連論文リスト