Related papers: FlyAOC: Evaluating Agentic Ontology Curation of Drosophila Scientific Knowledge Bases

FlyAOC: Evaluating Agentic Ontology Curation of Drosophila Scientific Knowledge Bases

URL: http://arxiv.org/abs/2602.09163v1
Date: Mon, 09 Feb 2026 20:12:38 GMT
Title: FlyAOC: Evaluating Agentic Ontology Curation of Drosophila Scientific Knowledge Bases
Authors: Xingjian Zhang, Sophia Moylan, Ziyang Xiong, Qiaozhu Mei, Yichen Luo, Jiaqi W. Ma,
Abstract summary: We present FlyBench to evaluate AI agents on end-to-end agentic curation from scientific literature.<n>Given only a gene symbol, agents must search and read from a corpus of 16,898 full-text papers to produce structured annotations.<n>The benchmark includes 7,397 expert-curated annotations across 100 genes drawn from FlyBase.
Score: 10.00386797940562
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Scientific knowledge bases accelerate discovery by curating findings from primary literature into structured, queryable formats for both human researchers and emerging AI systems. Maintaining these resources requires expert curators to search relevant papers, reconcile evidence across documents, and produce ontology-grounded annotations - a workflow that existing benchmarks, focused on isolated subtasks like named entity recognition or relation extraction, do not capture. We present FlyBench to evaluate AI agents on end-to-end agentic ontology curation from scientific literature. Given only a gene symbol, agents must search and read from a corpus of 16,898 full-text papers to produce structured annotations: Gene Ontology terms describing function, expression patterns, and historical synonyms linking decades of nomenclature. The benchmark includes 7,397 expert-curated annotations across 100 genes drawn from FlyBase, the Drosophila (fruit fly) knowledge base. We evaluate four baseline agent architectures: memorization, fixed pipeline, single-agent, and multi-agent. We find that architectural choices significantly impact performance, with multi-agent designs outperforming simpler alternatives, yet scaling backbone models yields diminishing returns. All baselines leave substantial room for improvement. Our analysis surfaces several findings to guide future development; for example, agents primarily use retrieval to confirm parametric knowledge rather than discover new information. We hope FlyBench will drive progress on retrieval-augmented scientific reasoning, a capability with broad applications across scientific domains.

Related papers

Retrieval Augmented Generation of Literature-derived Polymer Knowledge: The Example of a Biodegradable Polymer Expert System [4.222675210976564]
Polymer literature contains a large and growing body of experimental knowledge.<n>Much of it is buried in unstructured text and inconsistent terminology.<n>Existing tools typically extract narrow, study-specific facts in isolation.
arXiv Detail & Related papers (2026-02-18T17:46:09Z)
FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights [63.32178443510396]
We introduce FIRE-Bench (Full-cycle Insight Rediscovery Evaluation), a benchmark that evaluates agents through the rediscovery of established findings.<n>Even the strongest agents achieve limited rediscovery success (50 F1), exhibit high variance across runs, and display recurring failure modes in experimental design, execution, and evidence-based reasoning.
arXiv Detail & Related papers (2026-02-02T23:21:13Z)
PaperSearchQA: Learning to Search and Reason over Scientific Papers with RLVR [64.22412492998754]
We release a search corpus of 16 million biomedical paper abstracts and construct a challenging factoid QA dataset called PaperSearchQA.<n>We train search agents in this environment to outperform non-RL retrieval baselines.<n>Our data creation methods are scalable and easily extendable to other scientific domains.
arXiv Detail & Related papers (2026-01-26T06:46:16Z)
SciNetBench: A Relation-Aware Benchmark for Scientific Literature Retrieval Agents [12.057215000080705]
We propose SciNetBench, the first Scientific Network Relation-aware Benchmark for literature retrieval agents.<n>Our benchmark systematically evaluates three levels of relations: ego-centric retrieval of papers with novel knowledge structures, pair-wise identification of scholarly relationships, and path-wise reconstruction of scientific evolutionary trajectories.<n>We find that their accuracy on relation-aware retrieval tasks often falls below 20%, revealing a core shortcoming of current retrieval paradigms.
arXiv Detail & Related papers (2025-12-16T02:53:02Z)
OIDA-QA: A Multimodal Benchmark for Analyzing the Opioid Industry Documents Archive [50.468138755368805]
Opioid crisis represents a significant moment in public health.<n>Data and documents disclosed in the UCSF-JHU Opioid Industry Documents Archive (OIDA)<n>In this paper, we tackle this challenge by organizing the original dataset according to document attributes.
arXiv Detail & Related papers (2025-11-13T03:27:32Z)
ReplicationBench: Can AI Agents Replicate Astrophysics Research Papers? [29.17900668495058]
We introduce ReplicationBench, an evaluation framework for frontier AI agents.<n>It tests whether agents can replicate entire research papers drawn from the astrophysics literature.<n>R ReplicationBench establishes the first benchmark of paper-scale, expert-validated astrophysics research tasks.
arXiv Detail & Related papers (2025-10-28T16:21:19Z)
AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite [75.58737079136942]
We present AstaBench, a suite that provides the first holistic measure of agentic ability to perform scientific research.<n>Our suite comes with the first scientific research environment with production-grade search tools.<n>Our evaluation of 57 agents across 22 agent classes reveals several interesting findings.
arXiv Detail & Related papers (2025-10-24T17:10:26Z)
FROGENT: An End-to-End Full-process Drug Design Agent [19.025736969789566]
Powerful AI tools for drug discovery reside in isolated web apps, desktop programs, and code libraries.<n>To address this issue, a Full-pROcess druG dEsign ageNT, named FROGENT, has been proposed.<n>FROGENT utilizes a Large Language Model and the Model Context Protocol to integrate multiple dynamic biochemical databases, tool libraries, and task-specific AI models.
arXiv Detail & Related papers (2025-08-14T15:45:53Z)
HySemRAG: A Hybrid Semantic Retrieval-Augmented Generation Framework for Automated Literature Synthesis and Methodological Gap Analysis [55.2480439325792]
HySemRAG is a framework that combines Extract, Transform, Load (ETL) pipelines with Retrieval-Augmented Generation (RAG)<n>System addresses limitations in existing RAG architectures through a multi-layered approach.
arXiv Detail & Related papers (2025-08-01T20:30:42Z)
BioKGBench: A Knowledge Graph Checking Benchmark of AI Agent for Biomedical Science [43.624608816218505]
BioKGBench is an evaluation benchmark for AI-driven biomedical agents. We first disentangle "Understanding Literature" into two atomic abilities. We then formulate a novel agent task, dubbed KGCheck, using KGQA and domain-based Retrieval-Augmented Generation. We collect over two thousand data for two atomic tasks and 225 high-quality annotated data for the agent task.
arXiv Detail & Related papers (2024-06-29T15:23:28Z)
BioDiscoveryAgent: An AI Agent for Designing Genetic Perturbation Experiments [112.25067497985447]
We introduce BioDiscoveryAgent, an agent that designs new experiments, reasons about their outcomes, and efficiently navigates the hypothesis space to reach desired solutions.<n>BioDiscoveryAgent can uniquely design new experiments without the need to train a machine learning model.<n>It achieves an average of 21% improvement in predicting relevant genetic perturbations across six datasets.
arXiv Detail & Related papers (2024-05-27T19:57:17Z)
PaperQA: Retrieval-Augmented Generative Agent for Scientific Research [41.9628176602676]
We present PaperQA, a RAG agent for answering questions over the scientific literature. PaperQA is an agent that performs information retrieval across full-text scientific articles, assesses the relevance of sources and passages, and uses RAG to provide answers. We also introduce LitQA, a more complex benchmark that requires retrieval and synthesis of information from full-text scientific papers across the literature.
arXiv Detail & Related papers (2023-12-08T18:50:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.