Fugu-MT 論文翻訳(概要): OpenBioRQ: Unsolved Biomedical Research Questions for Agents

論文の概要: OpenBioRQ: Unsolved Biomedical Research Questions for Agents

arxiv url: http://arxiv.org/abs/2606.21959v1
Date: Sat, 20 Jun 2026 09:16:15 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-25 23:32:47.634451
Title: OpenBioRQ: Unsolved Biomedical Research Questions for Agents
Title（参考訳）: OpenBioRQ: エージェントのための未解決のバイオメディカル研究質問
Authors: Minbyul Jeong,
Abstract要約: 未解決のバイオメディカル研究のベンチマークであるtextopenbiorqを紹介します。これはエージェント的な設定 -- モデルが複数のツールコールを発行しなければならない -- と、解答キーを持たない未解決の質問を組み合わせた最初のベンチマークである。エージェントがツールの使用をやめる、最も難しい質問に対して、エージェントの崩壊を観察します。
参考スコア（独自算出の注目度）: 2.3677513428867614
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A working citation looks like proof -- but the fact that a link resolves does not mean the cited paper supports the claim. I find that current agentic models rarely fabricate citations (over $99\%$ resolve), yet roughly $15.9\%$ link to the wrong paper. Existing benchmarks miss this failure mode: when a question has a fixed answer key, a model can reproduce the expected source from that key rather than independently verifying that the source supports the claim. I introduce \textbf{\openbiorq{}}, a retrieval-grounded agentic benchmark of $12{,}553$ unsolved biomedical research questions across $12$ domains that treats open questions as a faithfulness-and-abstention probe. To my knowledge, this is the first biomedical benchmark to combine an agentic setting -- where the model must issue multiple tool calls -- with unsolved questions that have no answer key. Openness is verified against real follow-up evidence rather than a model's parametric knowledge. Difficulty is empirical: I anchor it on questions that three open-weight reference models fail to answer, rather than on subjective hardness labels. On this hardest subset, held-out models from the same lineage as the difficulty anchors solve only ~17%, while three independent frontier agents (Gemini-3-Pro, Opus-4.7, GPT-5.5) span a wide 29-60% range. The benchmark is thus hard, non-saturating (the best agent still leaves ~33-40\% unsolved), and discriminating across capability tiers. Beyond difficulty, I observe agentic collapse on the hardest questions, where agents stop using their tools. For the most collapse-prone model, blocking tool access entirely barely changes its score -- so tools stop paying off exactly where they are needed most. A frozen per-question checklist raises inter-judge agreement from Spearman 0.35 to 0.82.
Abstract（参考訳）: 作業中の引用は証明のように見えるが、リンクが解決したという事実は、引用された論文が主張を支持しているという意味ではない。現在のエージェントモデルは、引用を滅多に作らない($99\%以上)が、おおよそ$15.9\%は間違った論文へのリンクである。問題に固定された応答キーがある場合、モデルは、ソースがクレームをサポートすることを独立して検証するのではなく、そのキーから期待されるソースを再生することができる。このベンチマークは、12ドル(約1万2000円)のドメインにまたがる未解決のバイオメディカル・リサーチの質問で、オープンな質問を忠実で控えめな調査として扱う。私の知る限り、これはエージェント的な設定 -- モデルが複数のツールコールを発行しなければならない -- と、答えキーを持たない未解決の質問を組み合わせた最初のバイオメディカルベンチマークです。開放性はモデルのパラメトリック知識よりも実際の追跡証拠に対して検証される。 3つのオープンウェイトな参照モデルは、主観的な硬さラベルではなく、答えに答えられません。この最も難しいサブセットでは、難易度アンカーと同じ系統の保留モデルはわずか17%しか解決せず、3つの独立したフロンティアエージェント(Gemini-3-Pro, Opus-4.7, GPT-5.5)は29-60%の範囲にわたっている。したがって、ベンチマークは困難で、飽和していない(最高のエージェントは、まだ解けていない ~33-40\% を残している)。難易度を超えて、エージェントがツールの使用をやめる最も難しい質問に対して、エージェントの崩壊を観察します。最も崩壊しやすいモデルでは、ブロッキングツールへのアクセスがスコアをほとんど変更しない。フリーズ・パー・クエスト・チェックリストは、スピアマン 0.35 から 0.82 まで、ジャッジ間の合意を引き上げている。

論文の概要: OpenBioRQ: Unsolved Biomedical Research Questions for Agents

関連論文リスト