Fugu-MT 論文翻訳(概要): OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data

論文の概要: OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data

arxiv url: http://arxiv.org/abs/2603.15594v1
Date: Mon, 16 Mar 2026 17:52:04 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 18:28:58.713382
Title: OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data
Title（参考訳）: OpenSeeker: 完全オープンサーシングトレーニングデータによるフロンティア検索エージェントの民主化
Authors: Yuwen Du, Rui Ye, Shuo Tang, Xinyu Zhu, Yijun Lu, Yuzhu Cai, Siheng Chen,
Abstract要約: OpenSeekerは、フロンティアレベルのパフォーマンスを実現する最初の完全オープンソース検索エージェント(モデルとデータ)である。 OpenSeekerはBrowseComp, BrowseComp-ZH, xbench-DeepSearch, WideSearchなど,複数のベンチマークで最先端のパフォーマンスを実現している。私たちは、フロンティア検索エージェントの研究を民主化し、より透明で協力的なエコシステムを育むために、完全なトレーニングデータセットとモデルウェイトを完全にオープンソースにしています。
参考スコア（独自算出の注目度）: 43.955144962666196
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet the development of high-performance search agents remains dominated by industrial giants due to a lack of transparent, high-quality training data. This persistent data scarcity has fundamentally hindered the progress of the broader research community in developing and innovating within this domain. To bridge this gap, we introduce OpenSeeker, the first fully open-source search agent (i.e., model and data) that achieves frontier-level performance through two core technical innovations: (1) Fact-grounded scalable controllable QA synthesis, which reverse-engineers the web graph via topological expansion and entity obfuscation to generate complex, multi-hop reasoning tasks with controllable coverage and complexity. (2) Denoised trajectory synthesis, which employs a retrospective summarization mechanism to denoise the trajectory, therefore promoting the teacher LLMs to generate high-quality actions. Experimental results demonstrate that OpenSeeker, trained (a single training run) on only 11.7k synthesized samples, achieves state-of-the-art performance across multiple benchmarks including BrowseComp, BrowseComp-ZH, xbench-DeepSearch, and WideSearch. Notably, trained with simple SFT, OpenSeeker significantly outperforms the second-best fully open-source agent DeepDive (e.g., 29.5% v.s. 15.3% on BrowseComp), and even surpasses industrial competitors such as Tongyi DeepResearch (trained via extensive continual pre-training, SFT, and RL) on BrowseComp-ZH (48.4% v.s. 46.7%). We fully open-source the complete training dataset and the model weights to democratize frontier search agent research and foster a more transparent, collaborative ecosystem.
Abstract（参考訳）: 深層検索能力は、フロンティア大言語モデル(LLM)エージェントにとって必須の能力となっているが、透明で高品質な訓練データがないため、高性能検索エージェントの開発は産業巨人に支配されている。この永続的なデータ不足は、この領域内での開発と革新において、幅広い研究コミュニティの進歩を根本的に妨げている。このギャップを埋めるために,1) トポロジ的拡張とエンティティ難読化によってWebグラフをリバースエンジニアリングし,制御可能なカバレッジと複雑性を備えた複雑なマルチホップ推論タスクを生成する,Fact-grounded scalable controllable QA synthesisという,2つの技術革新を通じてフロンティアレベルのパフォーマンスを実現する,最初の完全オープンソース検索エージェント(モデルとデータ)であるOpenSeekerを紹介した。 2) トラジェクトリ合成は, 振り返りの要約機構を用いて, トラジェクトリを識別し, 教師のLCMに高品質な行動を起こさせるよう促す。実験の結果、たった1.7kのサンプルでトレーニングされた(単一のトレーニング実行)OpenSeekerは、BrowseComp、BrowseComp-ZH、xbench-DeepSearch、WideSearchを含む複数のベンチマークで最先端のパフォーマンスを達成した。特に、単純なSFTで訓練されたOpenSeekerは、BrowseCompのDeepDive(例:29.5%対15.3%、BrowseCompのDeepDive)を2番目に上回り、BrowseComp-ZH(48.4%対46.7%)のTongyi DeepResearch(大規模な継続事前トレーニング、SFT、RL)のような産業ライバルを抜いた。私たちは、フロンティア検索エージェントの研究を民主化し、より透明で協力的なエコシステムを育むために、完全なトレーニングデータセットとモデルウェイトを完全にオープンソースにしています。

論文の概要: OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data

関連論文リスト