Related papers: DEEPMED: Building a Medical DeepResearch Agent via Multi-hop Med-Search Data and Turn-Controlled Agentic Training & Inference

DEEPMED: Building a Medical DeepResearch Agent via Multi-hop Med-Search Data and Turn-Controlled Agentic Training & Inference

URL: http://arxiv.org/abs/2601.18496v2
Date: Wed, 04 Feb 2026 11:31:10 GMT
Title: DEEPMED: Building a Medical DeepResearch Agent via Multi-hop Med-Search Data and Turn-Controlled Agentic Training & Inference
Authors: Zihan Wang, Hao Wang, Shi Feng, Xiaocui Yang, Daling Wang, Yiqun Zhang, Jinghao Lin, Haihua Yang, Xiaozhong Ji,
Abstract summary: DeepResearch (DR) models ground outputs in verifiable evidence from tools and perform strongly in general domains.<n>We attribute this to two gaps: task characteristic and tool-use scaling.<n>DeepMed improves its base model by 9.79% on average and outperforms larger medical reasoning and DR models.
Score: 34.74491972658472
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Medical reasoning models remain constrained by parametric knowledge and are thus susceptible to forgetting and hallucinations. DeepResearch (DR) models ground outputs in verifiable evidence from tools and perform strongly in general domains, but their direct transfer to medical field yields relatively limited gains. We attribute this to two gaps: task characteristic and tool-use scaling. Medical questions require evidence interpretation in a knowledge-intensive clinical context; while general DR models can retrieve information, they often lack clinical-context reasoning and thus "find it but fail to use it," leaving performance limited by medical abilities. Moreover, in medical scenarios, blindly scaling tool-call can inject noisy context, derailing sensitive medical reasoning and prompting repetitive evidence-seeking along incorrect paths. Therefore, we propose DeepMed. For data, we deploy a multi-hop med-search QA synthesis method supporting the model to apply the DR paradigm in medical contexts. For training, we introduce a difficulty-aware turn-penalty to suppress excessive tool-call growth. For inference, we bring a monitor to help validate hypotheses within a controlled number of steps and avoid context rot. Overall, on seven medical benchmarks, DeepMed improves its base model by 9.79\% on average and outperforms larger medical reasoning and DR models.

Related papers

AMANDA: Agentic Medical Knowledge Augmentation for Data-Efficient Medical Visual Question Answering [34.90463380394591]
We propose AMANDA, a training-free agentic framework that performs medical knowledge augmentation via LLM agents.<n>Experiments across eight Med-VQA benchmarks demonstrate substantial improvements in both zero-shot and few-shot Med-VQA settings.
arXiv Detail & Related papers (2025-09-26T01:22:25Z)
End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning [52.12425911708585]
Deep-DxSearch is an agentic RAG system trained end-to-end with reinforcement learning (RL)<n>In Deep-DxSearch, we first construct a large-scale medical retrieval corpus comprising patient records and reliable medical knowledge sources.<n> Experiments demonstrate that our end-to-end RL training framework consistently outperforms prompt-engineering and training-free RAG approaches.
arXiv Detail & Related papers (2025-08-21T17:42:47Z)
Tree-of-Reasoning: Towards Complex Medical Diagnosis via Multi-Agent Reasoning with Evidence Tree [14.013981070330153]
We propose Tree-of-Reasoning (ToR), a novel multi-agent framework designed to handle complex scenarios.<n>Specifically, ToR introduces a tree structure that can clearly record the reasoning path of large language models (LLMs) and the corresponding clinical evidence.<n>At the same time, we propose a cross-validation mechanism to ensure the consistency of multi-agent decision-making.
arXiv Detail & Related papers (2025-08-05T03:31:28Z)
ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning [54.30630356786752]
ReasonMed is the largest medical reasoning dataset to date, with 370k high-quality examples.<n>It is built through a multi-agent generation, verification, and refinement process.<n>Using ReasonMed, we find that integrating detailed CoT reasoning with concise answer summaries yields the most robust fine-tuning results.
arXiv Detail & Related papers (2025-06-11T08:36:55Z)
R2MED: A Benchmark for Reasoning-Driven Medical Retrieval [21.743193381874878]
We introduce R2MED, the first benchmark explicitly designed for reasoning-driven medical retrieval.<n>It comprises 876 queries spanning three tasks: Q&A reference retrieval, clinical evidence retrieval, and clinical case retrieval.<n>We evaluate 15 widely-used retrieval systems on R2MED and find that even the best model achieves only 31.4 nDCG@10.
arXiv Detail & Related papers (2025-05-20T16:15:30Z)
MedHal: An Evaluation Dataset for Medical Hallucination Detection [4.98142540436183]
We present MedHal, a novel large-scale dataset specifically designed to evaluate if models can detect hallucinations in medical texts.<n>MedHal addresses gaps by: (1) incorporating diverse medical text sources and tasks; (2) providing a substantial volume of annotated samples suitable for training medical hallucination detection models; and (3) including explanations for factual inconsistencies to guide model learning.
arXiv Detail & Related papers (2025-04-11T14:55:15Z)
MedReason: Eliciting Factual Medical Reasoning Steps in LLMs via Knowledge Graphs [39.65443626577068]
We introduce MedReason, a high-quality medical reasoning dataset.<n>Our pipeline generates detailed reasoning for various medical questions from 7 medical datasets.<n>Our top-performing model, MedReason-8B, outperforms the Huatuo-o1-8B, a state-of-the-art medical reasoning model, by up to 4.2% on the clinical benchmark MedBullets.
arXiv Detail & Related papers (2025-04-01T17:31:44Z)
Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions.<n>We propose a novel approach utilizing structured medical reasoning.<n>Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z)
Towards Medical Artificial General Intelligence via Knowledge-Enhanced Multimodal Pretraining [121.89793208683625]
Medical artificial general intelligence (MAGI) enables one foundation model to solve different medical tasks. We propose a new paradigm called Medical-knedge-enhanced mulTimOdal pretRaining (MOTOR)
arXiv Detail & Related papers (2023-04-26T01:26:19Z)
Semi-Supervised Variational Reasoning for Medical Dialogue Generation [70.838542865384]
Two key characteristics are relevant for medical dialogue generation: patient states and physician actions. We propose an end-to-end variational reasoning approach to medical dialogue generation. A physician policy network composed of an action-classifier and two reasoning detectors is proposed for augmented reasoning ability.
arXiv Detail & Related papers (2021-05-13T04:14:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.