Related papers: BioAgent Bench: An AI Agent Evaluation Suite for Bioinformatics

BioAgent Bench: An AI Agent Evaluation Suite for Bioinformatics

URL: http://arxiv.org/abs/2601.21800v1
Date: Thu, 29 Jan 2026 14:44:03 GMT
Title: BioAgent Bench: An AI Agent Evaluation Suite for Bioinformatics
Authors: Dionizije Fa, Marko Čuljak, Bruno Pandža, Mateo Čupić,
Abstract summary: BioAgent Bench is a benchmark dataset and an evaluation suite designed for measuring the performance and robustness of AI agents.<n>The benchmark contains curated end-to-end tasks with prompts that specify concrete output artifacts to support automated assessment.<n>We evaluate frontier closed-source and open-weight models across multiple agent harnesses.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper introduces BioAgent Bench, a benchmark dataset and an evaluation suite designed for measuring the performance and robustness of AI agents in common bioinformatics tasks. The benchmark contains curated end-to-end tasks (e.g., RNA-seq, variant calling, metagenomics) with prompts that specify concrete output artifacts to support automated assessment, including stress testing under controlled perturbations. We evaluate frontier closed-source and open-weight models across multiple agent harnesses, and use an LLM-based grader to score pipeline progress and outcome validity. We find that frontier agents can complete multi-step bioinformatics pipelines without elaborate custom scaffolding, often producing the requested final artifacts reliably. However, robustness tests reveal failure modes under controlled perturbations (corrupted inputs, decoy files, and prompt bloat), indicating that correct high-level pipeline construction does not guarantee reliable step-level reasoning. Finally, because bioinformatics workflows may involve sensitive patient data, proprietary references, or unpublished IP, closed-source models can be unsuitable under strict privacy constraints; in such settings, open-weight models may be preferable despite lower completion rates. We release the dataset and evaluation suite publicly.

Related papers

Mozi: Governed Autonomy for Drug Discovery LLM Agents [21.429647382651677]
In dependency-heavy pharmaceutical pipelines, autonomous agents often drift into irreproducible trajectories.<n>We present Mozi, a dual-layer architecture that bridges the flexibility of generative AI with the deterministic rigor of computational biology.<n>We demonstrate Mozi's ability to navigate massive chemical spaces, enforce stringent toxicity filters, and generate highly competitive in silico candidates.
arXiv Detail & Related papers (2026-03-04T02:22:21Z)
Agentic AI for Self-Driving Laboratories in Soft Matter: Taxonomy, Benchmarks,and Open Challenges [8.153488410654004]
Self-driving laboratories (SDLs) close the loop between experiment design, automated execution, and data-driven decision making.<n>This survey uses soft matter as a representative setting but focuses on the AI questions that arise in real laboratories.
arXiv Detail & Related papers (2026-01-25T17:44:19Z)
Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via Test-Time Rubric-Guided Verification [71.98473277917962]
Recent advances in Deep Research Agents (DRAs) are transforming automated knowledge discovery and problem-solving.<n>We propose an alternative paradigm: self-evolving the agent's ability by iteratively verifying the policy model's outputs, guided by meticulously crafted rubrics.<n>We present DeepVerifier, a rubrics-based outcome reward verifier that leverages the asymmetry of verification.
arXiv Detail & Related papers (2026-01-22T09:47:31Z)
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces [126.23612941699565]
Terminal-Bench 2.0 is a benchmark composed of 89 tasks in computer terminal environments inspired by problems from real world.<n>We show that frontier models and agents score less than 65% on the benchmark.<n>We publish the dataset and evaluation harness to assist developers and researchers in future work at https://www.tbench.ai/.
arXiv Detail & Related papers (2026-01-17T01:29:30Z)
BiomechAgent: AI-Assisted Biomechanical Analysis Through Code-Generating Agents [1.1458853556386797]
We present BiomechAgent, a code-generating AI agent that enables biomechanical analysis through natural language.<n>We developed a benchmark spanning data retrieval, visualization, activity classification, temporal segmentation, and clinical reasoning.<n>Biomechanically-informed, domain-specific instructions significantly improved performance over generic prompts.
arXiv Detail & Related papers (2026-01-16T04:30:04Z)
Impatient Users Confuse AI Agents: High-fidelity Simulations of Human Traits for Testing Agents [58.00130492861884]
TraitBasis is a lightweight, model-agnostic method for systematically stress testing AI agents.<n>TraitBasis learns directions in activation space corresponding to steerable user traits.<n>We observe on average a 2%-30% performance degradation on $tau$-Trait across frontier models.
arXiv Detail & Related papers (2025-10-06T05:03:57Z)
Automatic Building Code Review: A Case Study [6.530899637501737]
Building officials face labor-intensive, error-prone, and costly manual reviews of design documents as projects increase in size and complexity.<n>This study introduces a novel agent-driven framework that integrates BIM-based data extraction with automated verification.
arXiv Detail & Related papers (2025-10-03T00:30:14Z)
ConformalSAM: Unlocking the Potential of Foundational Segmentation Models in Semi-Supervised Semantic Segmentation with Conformal Prediction [57.930531826380836]
This work explores whether a foundational segmentation model can address label scarcity in the pixel-level vision task as an annotator for unlabeled images.<n>We propose ConformalSAM, a novel SSSS framework which first calibrates the foundation model using the target domain's labeled data and then filters out unreliable pixel labels of unlabeled data.
arXiv Detail & Related papers (2025-07-21T17:02:57Z)
Stress-Testing ML Pipelines with Adversarial Data Corruption [11.91482648083998]
Regulators now demand evidence that high-stakes systems can withstand realistic, interdependent errors.<n>We introduce SAVAGE, a framework that formally models data-quality issues through dependency graphs and flexible corruption templates.<n>Savanage employs a bi-level optimization approach to efficiently identify vulnerable data subpopulations and fine-tune corruption severity.
arXiv Detail & Related papers (2025-06-02T00:41:24Z)
InfoDeepSeek: Benchmarking Agentic Information Seeking for Retrieval-Augmented Generation [63.55258191625131]
InfoDeepSeek is a new benchmark for assessing agentic information seeking in real-world, dynamic web environments.<n>We propose a systematic methodology for constructing challenging queries satisfying the criteria of determinacy, difficulty, and diversity.<n>We develop the first evaluation framework tailored to dynamic agentic information seeking, including fine-grained metrics about the accuracy, utility, and compactness of information seeking outcomes.
arXiv Detail & Related papers (2025-05-21T14:44:40Z)
LLM Agent Swarm for Hypothesis-Driven Drug Discovery [2.7036595757881323]
PharmaSwarm is a unified multi-agent framework that orchestrates specialized "agents" to propose, validate, and refine hypotheses for novel drug targets and lead compounds.<n>By acting as an AI copilot, PharmaSwarm can accelerate translational research and deliver high-confidence hypotheses more efficiently than traditional pipelines.
arXiv Detail & Related papers (2025-04-24T22:27:50Z)
Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs [71.7892165868749]
Commercial Large Language Model (LLM) APIs create a fundamental trust problem.<n>Users pay for specific models but have no guarantee that providers deliver them faithfully.<n>We formalize this model substitution problem and evaluate detection methods under realistic adversarial conditions.<n>We propose and evaluate the use of Trusted Execution Environments (TEEs) as one practical and robust solution.
arXiv Detail & Related papers (2025-04-07T03:57:41Z)
MIBP-Cert: Certified Training against Data Perturbations with Mixed-Integer Bilinear Programs [50.41998220099097]
Data errors, corruptions, and poisoning attacks during training pose a major threat to the reliability of modern AI systems.<n>We introduce MIBP-Cert, a novel certification method based on mixed-integer bilinear programming (MIBP)<n>By computing the set of parameters reachable through perturbed or manipulated data, we can predict all possible outcomes and guarantee robustness.
arXiv Detail & Related papers (2024-12-13T14:56:39Z)
Benchmarking Uncertainty Qualification on Biosignal Classification Tasks under Dataset Shift [16.15816241847314]
We propose a framework to evaluate the capability of the estimated uncertainty in capturing different types of biosignal dataset shifts. In particular, we use three classification tasks based on respiratory sounds and electrocardiography signals to benchmark five representative uncertainty qualification methods.
arXiv Detail & Related papers (2021-12-16T20:42:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.