Fugu-MT 論文翻訳(概要): LABBench2: An Improved Benchmark for AI Systems Performing Biology Research

論文の概要: LABBench2: An Improved Benchmark for AI Systems Performing Biology Research

arxiv url: http://arxiv.org/abs/2604.09554v1
Date: Wed, 04 Feb 2026 18:50:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-19 19:09:11.488785
Title: LABBench2: An Improved Benchmark for AI Systems Performing Biology Research
Title（参考訳）: LABBench2: 生物学研究を実行するAIシステムのベンチマーク改善
Authors: Jon M Laurent, Albert Bou, Michael Pieler, Conor Igoe, Alex Andonian, Siddharth Narayanan, James Braza, Alexandros Sanchez Vassopoulos, Jacob L Steenwyk, Blake Lash, Andrew D White, Samuel G Rodriques,
Abstract要約: 我々は、科学的なタスクを実行するAIシステムの実世界の能力を測定するためのベンチマークの進化を紹介する。 LabBench2 は 1,900 のタスクから構成されており、ほとんどの場合 LAB-Bench の継続である。我々は,現在のフロンティアモデルの性能を評価し,LAB-BenchとLABBench2で測定された能力が大幅に向上したのに対して,LABBench2は難易度に有意義なジャンプを与えることを示した。
参考スコア（独自算出の注目度）: 33.86687294196014
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Optimism for accelerating scientific discovery with AI continues to grow. Current applications of AI in scientific research range from training dedicated foundation models on scientific data to agentic autonomous hypothesis generation systems to AI-driven autonomous labs. The need to measure progress of AI systems in scientific domains correspondingly must not only accelerate, but increasingly shift focus to more real-world capabilities. Beyond rote knowledge and even just reasoning to actually measuring the ability to perform meaningful work. Prior work introduced the Language Agent Biology Benchmark LAB-Bench as an initial attempt at measuring these abilities. Here we introduce an evolution of that benchmark, LABBench2, for measuring real-world capabilities of AI systems performing useful scientific tasks. LABBench2 comprises nearly 1,900 tasks and is, for the most part, a continuation of LAB-Bench, measuring similar capabilities but in more realistic contexts. We evaluate performance of current frontier models, and show that while abilities measured by LAB-Bench and LABBench2 have improved substantially, LABBench2 provides a meaningful jump in difficulty (model-specific accuracy differences range from -26% to -46% across subtasks) and underscores continued room for performance improvement. LABBench2 continues the legacy of LAB-Bench as a de facto benchmark for AI scientific research capabilities and we hope that it continues to help advance development of AI tools for these core research functions. To facilitate community use and development, we provide the task dataset at https://huggingface.co/datasets/futurehouse/labbench2 and a public eval harness at https://github.com/EdisonScientific/labbench2.
Abstract（参考訳）: AIによる科学的発見を加速するための最適化は、成長を続けている。科学研究におけるAIの現在の応用は、科学データに関する専門的な基礎モデルをトレーニングすることから、エージェントによる自律仮説生成システムから、AI駆動の自律ラボまで様々である。科学領域におけるAIシステムの進歩を測定する必要性は、加速するだけでなく、より現実世界の能力に焦点を移さなければならない。知識を損なうだけでなく、実際に有意義な作業を行う能力を測定する理由としても必要です。以前の作業では、これらの能力を測定する最初の試みとして、Language Agent Biology Benchmark LAB-Benchが導入されていた。ここでは、有用な科学的タスクを実行するAIシステムの実世界の能力を測定するために、そのベンチマークであるLABBench2の進化を紹介する。 LABBench2 は 1,900 のタスクから構成されており、ほとんどの場合 LAB-Bench の継続であり、同様の能力を持つが、より現実的な文脈で測定される。我々は,現在のフロンティアモデルの性能を評価し,LAB-BenchとLABBench2で測定された性能が大幅に向上したのに対して,LABBench2は難易度(サブタスク間でのモデル固有精度差が-26%から46%)を有意義に向上し,性能改善のためのアンダースコアが継続することを示した。 LABBench2は、AI科学研究機能のデファクトベンチマークとして、LAB-Benchの遺産を維持しています。コミュニティの利用と開発を容易にするため、https://huggingface.co/datasets/futurehouse/labbench2のタスクデータセットと、https://github.com/EdisonScientific/labbench2の公開evalハーネスを提供する。

論文の概要: LABBench2: An Improved Benchmark for AI Systems Performing Biology Research

関連論文リスト