Fugu-MT 論文翻訳(概要): BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents

論文の概要: BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents

arxiv url: http://arxiv.org/abs/2605.06177v1
Date: Thu, 07 May 2026 12:57:18 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-08 22:27:11.800919
Title: BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents
Title（参考訳）: BioMedArena: バイオメディカルディープリサーチエージェントの構築と評価のためのオープンソースツールキット
Authors: Jinge Wu, Hongjian Zhou, Mingde Zeng, Jiayuan Zhu, Junde Wu, Jiazhen Pan, Sean Wu, Honghan Wu, Fenglin Liu, David A. Clifton,
Abstract要約: ディープリサーチエージェントを評価するためのオープンソースのツールキットであるBioMedArenaをリリースする。 BioMedArenaは6層のバイオメディカルエージェント評価を分離する。 147のバイオメディカルベンチマークと75のバイオメディカルツールを9つのファミリーに公開している。
参考スコア（独自算出の注目度）: 35.04967801827194
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Building a deep research agent today is an exercise in glue code: the same backbone evaluated on the same benchmark can report different accuracies in different papers because harness and tool registry all differ, and integrating a new foundation model into a comparable evaluation surface costs weeks of model-specific engineering. We call this the per-paper engineering tax and release BioMedArena, an open-source toolkit that not only alleviates it but also provides an arena for fair comparison of different foundation models when evaluating them as deep-research agents. BioMedArena decouples six layers of biomedical agent evaluation -- benchmark loading, tool exposure, tool selection, execution mode, context management, and scoring -- and exposes 147 biomedical benchmarks and 75 biomedical tools across 9 functional families. Adding a new model, benchmark, or tool reduces to registering a few-line provider adapter. We further provide 6 agent harnesses with 6 context-management strategies, which provide 12 backbones with competitive research capabilities and significantly improved performance, achieving state-of-the-art (SOTA) results on 8 representative biomedical benchmarks, with an average lift of +15.03 percentage points over prior SOTA. The toolkit, configurations, and per-task traces are available at https://github.com/AI-in-Health/BioMedArena
Abstract（参考訳）: 同じベンチマークで評価された同じバックボーンは、ハーネスとツールレジストリが異なるため、さまざまな論文に異なる精度を報告できます。われわれはこれを論文ごとのエンジニアリング税と呼び、BioMedArenaをオープンソースツールキットとしてリリースしている。 BioMedArenaは、ベンチマークローディング、ツールエクスポージャー、ツール選択、実行モード、コンテキスト管理、スコアリングの6つのレイヤを分離し、9つの機能ファミリーにわたる147のバイオメディカルベンチマークと75のバイオメディカルツールを公開する。新しいモデルやベンチマーク、ツールを追加することで、数行のプロバイダアダプタの登録が削減される。さらに、6つの文脈管理戦略を備えたエージェントハーネスを6つ提供し、12個のバックボーンに競合研究能力を提供し、性能を著しく向上させ、8つの代表的なバイオメディカルベンチマークの最先端(SOTA)結果を達成する。ツールキット、設定、タスク毎のトレースはhttps://github.com/AI-in-Health/BioMedArenaで入手できる。

論文の概要: BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents

関連論文リスト