From Task Executors to Research Partners: Evaluating AI Co-Pilots Through Workflow Integration in Biomedical Research
- URL: http://arxiv.org/abs/2512.04854v1
- Date: Thu, 04 Dec 2025 14:37:46 GMT
- Title: From Task Executors to Research Partners: Evaluating AI Co-Pilots Through Workflow Integration in Biomedical Research
- Authors: Lukas Weidener, Marko Brkić, Chiara Bacci, Mihailo Jovanović, Emre Ulgac, Alex Dobrin, Johannes Weniger, Martin Vlas, Ritvik Singh, Aakaash Meduri,
- Abstract summary: This rapid review examines benchmarking practices for AI systems in preclinical biomedical research.<n>A process-oriented evaluation framework is proposed that addresses four critical dimensions absent from current benchmarks.<n>These dimensions are essential for evaluating AI systems as research co-pilots rather than as isolated task executors.
- Score: 0.16174969956296248
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Artificial intelligence systems are increasingly deployed in biomedical research. However, current evaluation frameworks may inadequately assess their effectiveness as research collaborators. This rapid review examines benchmarking practices for AI systems in preclinical biomedical research. Three major databases and two preprint servers were searched from January 1, 2018 to October 31, 2025, identifying 14 benchmarks that assess AI capabilities in literature understanding, experimental design, and hypothesis generation. The results revealed that all current benchmarks assess isolated component capabilities, including data analysis quality, hypothesis validity, and experimental protocol design. However, authentic research collaboration requires integrated workflows spanning multiple sessions, with contextual memory, adaptive dialogue, and constraint propagation. This gap implies that systems excelling on component benchmarks may fail as practical research co-pilots. A process-oriented evaluation framework is proposed that addresses four critical dimensions absent from current benchmarks: dialogue quality, workflow orchestration, session continuity, and researcher experience. These dimensions are essential for evaluating AI systems as research co-pilots rather than as isolated task executors.
Related papers
- The Story is Not the Science: Execution-Grounded Evaluation of Mechanistic Interpretability Research [56.80927148740585]
We address the challenges of scalability and rigor by flipping the dynamic and developing AI agents as research evaluators.<n>We use mechanistic interpretability research as a testbed, build standardized research output, and develop MechEvalAgent.<n>Our work demonstrates the potential of AI agents to transform research evaluation and pave the way for rigorous scientific practices.
arXiv Detail & Related papers (2026-02-05T19:00:02Z) - Rethinking the AI Scientist: Interactive Multi-Agent Workflows for Scientific Discovery [0.17341675932416767]
This paper introduces Deep Research, a multi-agent system enabling interactive scientific investigation with turnaround times measured in minutes.<n>The architecture comprises specialized agents for planning, data analysis, literature search, and novelty detection, unified through a persistent world state.<n> Evaluation on the BixBench computational biology benchmark demonstrated state-of-the-art performance, achieving 48.8% accuracy on open response and 64.4% on multiple-choice evaluation.
arXiv Detail & Related papers (2026-01-18T19:12:41Z) - Let the Barbarians In: How AI Can Accelerate Systems Performance Research [80.43506848683633]
We term this iterative cycle of generation, evaluation, and refinement AI-Driven Research for Systems.<n>We demonstrate that ADRS-generated solutions can match or even outperform human state-of-the-art designs.
arXiv Detail & Related papers (2025-12-16T18:51:23Z) - SelfAI: Building a Self-Training AI System with LLM Agents [79.10991818561907]
SelfAI is a general multi-agent platform that combines a User Agent for translating high-level research objectives into standardized experimental configurations.<n>An Experiment Manager orchestrates parallel, fault-tolerant training across heterogeneous hardware while maintaining a structured knowledge base for continuous feedback.<n>Across regression, computer vision, scientific computing, medical imaging, and drug discovery benchmarks, SelfAI consistently achieves strong performance and reduces redundant trials.
arXiv Detail & Related papers (2025-11-29T09:18:39Z) - AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite [75.58737079136942]
We present AstaBench, a suite that provides the first holistic measure of agentic ability to perform scientific research.<n>Our suite comes with the first scientific research environment with production-grade search tools.<n>Our evaluation of 57 agents across 22 agent classes reveals several interesting findings.
arXiv Detail & Related papers (2025-10-24T17:10:26Z) - Machine Text Detectors are Membership Inference Attacks [55.07733196689313]
We theoretically and empirically investigate the transferability, i.e., how well a method originally developed for one task performs on the other.<n>Our large-scale empirical experiments, including 7 state-of-the-art MIA methods and 5 state-of-the-art machine text detectors, demonstrate very strong rank correlation (rho >) in cross-task performance.<n>Our findings highlight the need for greater cross-task awareness and collaboration between the two research communities.
arXiv Detail & Related papers (2025-10-22T11:39:01Z) - LiveResearchBench: A Live Benchmark for User-Centric Deep Research in the Wild [86.6586720134927]
LiveResearchBench is a benchmark of 100 expert-curated tasks spanning daily life, enterprise, and academia.<n>DeepEval is a comprehensive suite covering both content- and report-level quality.<n>Our analysis reveals current strengths, recurring failure modes, and key system components needed to advance reliable, insightful deep research.
arXiv Detail & Related papers (2025-10-16T02:49:16Z) - Towards Personalized Deep Research: Benchmarks and Evaluations [56.581105664044436]
We introduce Personalized Deep Research Bench, the first benchmark for evaluating personalization in Deep Research Agents (DRAs)<n>It pairs 50 diverse research tasks with 25 authentic user profiles that combine structured persona attributes with dynamic real-world contexts, yielding 250 realistic user-task queries.<n>Our experiments on a range of systems highlight current capabilities and limitations in handling personalized deep research.
arXiv Detail & Related papers (2025-09-29T17:39:17Z) - ResearcherBench: Evaluating Deep AI Research Systems on the Frontiers of Scientific Inquiry [22.615102398311432]
We introduce ResearcherBench, the first benchmark focused on evaluating the capabilities of deep AI research systems.<n>We compiled a dataset of 65 research questions expertly selected from real-world scientific scenarios.<n>OpenAI Deep Research and Gemini Deep Research significantly outperform other systems, with particular strength in open-ended consulting questions.
arXiv Detail & Related papers (2025-07-22T06:51:26Z) - An AI-Driven Live Systematic Reviews in the Brain-Heart Interconnectome: Minimizing Research Waste and Advancing Evidence Synthesis [29.81784450632149]
We develop an AI-driven system to enhance systematic reviews in the Brain-Heart Interconnectome (BHI) domain.<n>The system integrates automated detection of Population, Intervention, Comparator, Outcome, and Study design (PICOS), semantic search using vector embeddings, graph-based querying, and topic modeling.<n>The system provides real-time updates, reducing research waste through a living database and offering an interactive interface with dashboards and conversational AI.
arXiv Detail & Related papers (2025-01-25T03:51:07Z) - From Intention To Implementation: Automating Biomedical Research via LLMs [30.32209981487504]
This paper introduces BioResearcher, the first end-to-end automated system designed to streamline the entire biomedical research process involving dry lab experiments.<n>BioResearcher employs a modular multi-agent architecture, integrating specialized agents for search, literature processing, experimental design, and programming.<n>The generated protocols, on average, outperform typical agent systems by 22.0% on five quality metrics.
arXiv Detail & Related papers (2024-12-12T16:35:05Z) - STRICTA: Structured Reasoning in Critical Text Assessment for Peer Review and Beyond [68.47402386668846]
We introduce Structured Reasoning In Critical Text Assessment (STRICTA) to model text assessment as an explicit, step-wise reasoning process.<n>STRICTA breaks down the assessment into a graph of interconnected reasoning steps drawing on causality theory.<n>We apply STRICTA to a dataset of over 4000 reasoning steps from roughly 40 biomedical experts on more than 20 papers.
arXiv Detail & Related papers (2024-09-09T06:55:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.