Related papers: MedAI: Evaluating TxAgent's Therapeutic Agentic Reasoning in the NeurIPS CURE-Bench Competition

MedAI: Evaluating TxAgent's Therapeutic Agentic Reasoning in the NeurIPS CURE-Bench Competition

URL: http://arxiv.org/abs/2512.11682v1
Date: Fri, 12 Dec 2025 16:01:48 GMT
Title: MedAI: Evaluating TxAgent's Therapeutic Agentic Reasoning in the NeurIPS CURE-Bench Competition
Authors: Tim Cofala, Christian Kalfar, Jingge Xiao, Johanna Schrader, Michelle Tang, Wolfgang Nejdl,
Abstract summary: Therapeutic decision-making in clinical medicine requires robust, multi-step reasoning grounded in reliable biomedical knowledge.<n>Agentic AI methods, exemplified by TxAgent, address these challenges through iterative retrieval-augmented generation (RAG)<n>This work presents insights derived from our participation in the CURE-Bench NeurIPS 2025 Challenge, which benchmarks therapeutic-reasoning systems.
Score: 6.191248426050678
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Therapeutic decision-making in clinical medicine constitutes a high-stakes domain in which AI guidance interacts with complex interactions among patient characteristics, disease processes, and pharmacological agents. Tasks such as drug recommendation, treatment planning, and adverse-effect prediction demand robust, multi-step reasoning grounded in reliable biomedical knowledge. Agentic AI methods, exemplified by TxAgent, address these challenges through iterative retrieval-augmented generation (RAG). TxAgent employs a fine-tuned Llama-3.1-8B model that dynamically generates and executes function calls to a unified biomedical tool suite (ToolUniverse), integrating FDA Drug API, OpenTargets, and Monarch resources to ensure access to current therapeutic information. In contrast to general-purpose RAG systems, medical applications impose stringent safety constraints, rendering the accuracy of both the reasoning trace and the sequence of tool invocations critical. These considerations motivate evaluation protocols treating token-level reasoning and tool-usage behaviors as explicit supervision signals. This work presents insights derived from our participation in the CURE-Bench NeurIPS 2025 Challenge, which benchmarks therapeutic-reasoning systems using metrics that assess correctness, tool utilization, and reasoning quality. We analyze how retrieval quality for function (tool) calls influences overall model performance and demonstrate performance gains achieved through improved tool-retrieval strategies. Our work was awarded the Excellence Award in Open Science. Complete information can be found at https://curebench.ai/.

Related papers

EvoClinician: A Self-Evolving Agent for Multi-Turn Medical Diagnosis via Test-Time Evolutionary Learning [72.70291772077738]
We propose Med-Inquire, a new benchmark designed to evaluate an agent's ability to perform multi-turn diagnosis.<n>We then introduce EvoClinician, a self-evolving agent that learns efficient diagnostic strategies at test time.<n>Our experiments show EvoClinician outperforms continual learning baselines and other self-evolving agents like memory agents.
arXiv Detail & Related papers (2026-01-30T13:26:18Z)
AgentMental: An Interactive Multi-Agent Framework for Explainable and Adaptive Mental Health Assessment [31.920800599579906]
Mental health assessment is crucial for early intervention and effective treatment, yet traditional clinician-based approaches are limited by the shortage of qualified professionals.<n>Recent advances in artificial intelligence have sparked growing interest in automated psychological assessment, yet most existing approaches are constrained by their reliance on static text analysis.<n>We propose a multi-agent framework for mental health evaluation that simulates clinical doctor-patient dialogues.
arXiv Detail & Related papers (2025-08-15T16:20:45Z)
ConfAgents: A Conformal-Guided Multi-Agent Framework for Cost-Efficient Medical Diagnosis [11.18347744454527]
We introduce HealthFlow, a self-evolving AI agent that overcomes limitations through a novel meta-level evolution mechanism.<n>HealthFlow autonomously refines its own high-level problem-solving policies by distilling procedural successes and failures into a durable, strategic knowledge base.<n>Our experiments demonstrate that HealthFlow's self-evolving approach significantly outperforms state-of-the-art agent frameworks.
arXiv Detail & Related papers (2025-08-06T22:39:38Z)
Medical Reasoning in the Era of LLMs: A Systematic Review of Enhancement Techniques and Applications [59.721265428780946]
Large Language Models (LLMs) in medicine have enabled impressive capabilities, yet a critical gap remains in their ability to perform systematic, transparent, and verifiable reasoning.<n>This paper provides the first systematic review of this emerging field.<n>We propose a taxonomy of reasoning enhancement techniques, categorized into training-time strategies and test-time mechanisms.
arXiv Detail & Related papers (2025-08-01T14:41:31Z)
Robot-Gated Interactive Imitation Learning with Adaptive Intervention Mechanism [48.41735416075536]
Interactive Imitation Learning (IIL) allows agents to acquire desired behaviors through human interventions.<n>We propose the Adaptive Intervention Mechanism (AIM), a novel robot-gated IIL algorithm that learns an adaptive criterion for requesting human demonstrations.
arXiv Detail & Related papers (2025-06-10T18:43:26Z)
DrugPilot: LLM-based Parameterized Reasoning Agent for Drug Discovery [54.79763887844838]
Large language models (LLMs) integrated with autonomous agents hold significant potential for advancing scientific discovery through automated reasoning and task execution.<n>We introduce DrugPilot, a LLM-based agent system with a parameterized reasoning architecture designed for end-to-end scientific in drug discovery.<n>DrugPilot significantly outperforms state-of-the-art agents such as ReAct and LoT, achieving task completion rates of 98.0%, 93.5%, and 64.0% for simple, multi-tool, and multi-turn scenarios, respectively.
arXiv Detail & Related papers (2025-05-20T05:18:15Z)
LLM Agent Swarm for Hypothesis-Driven Drug Discovery [2.7036595757881323]
PharmaSwarm is a unified multi-agent framework that orchestrates specialized "agents" to propose, validate, and refine hypotheses for novel drug targets and lead compounds.<n>By acting as an AI copilot, PharmaSwarm can accelerate translational research and deliver high-confidence hypotheses more efficiently than traditional pipelines.
arXiv Detail & Related papers (2025-04-24T22:27:50Z)
TxAgent: An AI Agent for Therapeutic Reasoning Across a Universe of Tools [22.322166889507184]
TxAgent is an AI agent that analyzes drug interactions, contraindications, and patient-specific treatment strategies.<n>The ToolUniverse consolidates 211 tools from trusted sources, including all US FDA-approved drugs since 1939.<n>It achieves 92.1% accuracy in open-ended drug reasoning tasks, surpassing GPT-4o and outperforming DeepSeek-R1 (671B) in structured multi-step reasoning.
arXiv Detail & Related papers (2025-03-14T00:28:15Z)
Towards Next-Generation Medical Agent: How o1 is Reshaping Decision-Making in Medical Scenarios [46.729092855387165]
We study the choice of the backbone LLM for medical AI agents, which is the foundation for the agent's overall reasoning and action generation.<n>Our findings demonstrate o1's ability to enhance diagnostic accuracy and consistency, paving the way for smarter, more responsive AI tools.
arXiv Detail & Related papers (2024-11-16T18:19:53Z)
AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments [2.567146936147657]
We introduce AgentClinic, a multimodal agent benchmark for evaluating large language models (LLM) in simulated clinical environments.<n>We find that solving MedQA problems in the sequential decision-making format of AgentClinic is considerably more challenging, resulting in diagnostic accuracies that can drop to below a tenth of the original accuracy.
arXiv Detail & Related papers (2024-05-13T17:38:53Z)
Physical formula enhanced multi-task learning for pharmacokinetics prediction [54.13787789006417]
A major challenge for AI-driven drug discovery is the scarcity of high-quality data. We develop a formula enhanced mul-ti-task learning (PEMAL) method that predicts four key parameters of pharmacokinetics simultaneously. Our experiments reveal that PEMAL significantly lowers the data demand, compared to typical Graph Neural Networks.
arXiv Detail & Related papers (2024-04-16T07:42:55Z)
DrugOOD: Out-of-Distribution (OOD) Dataset Curator and Benchmark for AI-aided Drug Discovery -- A Focus on Affinity Prediction Problems with Noise Annotations [90.27736364704108]
We present DrugOOD, a systematic OOD dataset curator and benchmark for AI-aided drug discovery. DrugOOD comes with an open-source Python package that fully automates benchmarking processes. We focus on one of the most crucial problems in AIDD: drug target binding affinity prediction.
arXiv Detail & Related papers (2022-01-24T12:32:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.