MedSAM-Agent: Empowering Interactive Medical Image Segmentation with Multi-turn Agentic Reinforcement Learning
- URL: http://arxiv.org/abs/2602.03320v1
- Date: Tue, 03 Feb 2026 09:47:49 GMT
- Title: MedSAM-Agent: Empowering Interactive Medical Image Segmentation with Multi-turn Agentic Reinforcement Learning
- Authors: Shengyuan Liu, Liuxin Bao, Qi Yang, Wanting Geng, Boyun Zheng, Chenxin Li, Wenting Chen, Houwen Peng, Yixuan Yuan,
- Abstract summary: MedSAM-Agent is a framework that reformulates interactive segmentation as a multi-step autonomous decision-making process.<n>We develop a two-stage training pipeline that integrates multi-turn, end-to-end outcome verification.<n>Experiments across 6 medical modalities and 21 datasets demonstrate that MedSAM-Agent achieves state-of-the-art performance.
- Score: 53.37068897861388
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Medical image segmentation is evolving from task-specific models toward generalizable frameworks. Recent research leverages Multi-modal Large Language Models (MLLMs) as autonomous agents, employing reinforcement learning with verifiable reward (RLVR) to orchestrate specialized tools like the Segment Anything Model (SAM). However, these approaches often rely on single-turn, rigid interaction strategies and lack process-level supervision during training, which hinders their ability to fully exploit the dynamic potential of interactive tools and leads to redundant actions. To bridge this gap, we propose MedSAM-Agent, a framework that reformulates interactive segmentation as a multi-step autonomous decision-making process. First, we introduce a hybrid prompting strategy for expert-curated trajectory generation, enabling the model to internalize human-like decision heuristics and adaptive refinement strategies. Furthermore, we develop a two-stage training pipeline that integrates multi-turn, end-to-end outcome verification with a clinical-fidelity process reward design to promote interaction parsimony and decision efficiency. Extensive experiments across 6 medical modalities and 21 datasets demonstrate that MedSAM-Agent achieves state-of-the-art performance, effectively unifying autonomous medical reasoning with robust, iterative optimization. Code is available \href{https://github.com/CUHK-AIM-Group/MedSAM-Agent}{here}.
Related papers
- Multi-Agent Model-Based Reinforcement Learning with Joint State-Action Learned Embeddings [10.36125908359289]
We present a novel model-based multi-agent reinforcement learning framework.<n>We design a world model trained with variational auto-encoders and augment the model using the state-action learned embedding.<n>By coupling imagined trajectories with SALE-based action values, the agents acquire a richer understanding of how their choices influence collective outcomes.
arXiv Detail & Related papers (2026-02-13T01:57:21Z) - MARTI-MARS$^2$: Scaling Multi-Agent Self-Search via Reinforcement Learning for Code Generation [64.2621682259008]
Multi-Agent Reinforced Training and Inference Framework with Self-Search Scaling (MARTI-MARS2)<n>We propose a Multi-Agent Reinforced Training and Inference Framework with Self-Search Scaling (MARTI-MARS2) to integrate policy learning with multi-agent tree search.<n>We show that MARTI-MARS2 achieves 77.7%, outperforming strong baselines like GPT-5.1 on challenging code generation benchmarks.
arXiv Detail & Related papers (2026-02-08T07:28:44Z) - ARM: Role-Conditioned Neuron Transplantation for Training-Free Generalist LLM Agent Merging [51.409102048965394]
Agent-Role Merging (ARM) is an activation-guided, role-conditioned neuron transplantation method for model merging in LLM agents.<n>ARM improves existing merging methods from static natural language tasks to multi-turn agent scenarios.
arXiv Detail & Related papers (2026-01-12T08:31:53Z) - Code-in-the-Loop Forensics: Agentic Tool Use for Image Forgery Detection [59.04089915447622]
ForenAgent is an interactive IFD framework that enables MLLMs to autonomously generate, execute, and refine Python-based low-level tools around the detection objective.<n>Inspired by human reasoning, we design a dynamic reasoning loop comprising global perception, local focusing, iterative probing, and holistic adjudication.<n>Experiments show that ForenAgent exhibits emergent tool-use competence and reflective reasoning on challenging IFD tasks.
arXiv Detail & Related papers (2025-12-18T08:38:44Z) - EndoAgent: A Memory-Guided Reflective Agent for Intelligent Endoscopic Vision-to-Decision Reasoning [6.96058549084651]
EndoAgent is a memory-guided agent for vision-to-decision endoscopic analysis.<n>It integrates iterative reasoning with adaptive tool selection and collaboration.<n>It consistently outperforms both general and medical multimodal models.
arXiv Detail & Related papers (2025-08-10T11:02:57Z) - DynamiCare: A Dynamic Multi-Agent Framework for Interactive and Open-Ended Medical Decision-Making [4.801722645791233]
DynamiCare is a novel dynamic multi-agent framework that models clinical diagnosis as a multi-round, interactive loop.<n>We demonstrate the feasibility and effectiveness of DynamiCare through extensive experiments.
arXiv Detail & Related papers (2025-07-03T13:43:10Z) - MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks [27.717720332927296]
We introduce MedAgentBoard, a comprehensive benchmark for the systematic evaluation of multi-agent collaboration, single-LLM, and conventional approaches.<n> MedAgentBoard encompasses four diverse medical task categories: medical (visual) question answering, lay summary generation, structured Electronic Health Record (EHR) predictive modeling, and clinical workflow automation.<n>Our extensive experiments reveal a nuanced landscape: while multi-agent collaboration demonstrates benefits in specific scenarios, it does not consistently outperform advanced single LLMs.
arXiv Detail & Related papers (2025-05-18T11:28:17Z) - MORAL: A Multimodal Reinforcement Learning Framework for Decision Making in Autonomous Laboratories [4.503215272392276]
We propose MORAL (a multimodal reinforcement learning framework for decision making in autonomous laboratories)<n>We generate fine-tuned image captions with a pretrained BLIP-2 vision-language model and combine them with visual features through an early fusion strategy.<n> Experimental results demonstrate that multimodal agents achieve a 20% improvement in task completion rates.
arXiv Detail & Related papers (2025-04-04T04:15:52Z) - From Novice to Expert: LLM Agent Policy Optimization via Step-wise Reinforcement Learning [62.54484062185869]
We introduce StepAgent, which utilizes step-wise reward to optimize the agent's reinforcement learning process.<n>We propose implicit-reward and inverse reinforcement learning techniques to facilitate agent reflection and policy adjustment.
arXiv Detail & Related papers (2024-11-06T10:35:11Z) - Multi-Agent Imitation Learning with Copulas [102.27052968901894]
Multi-agent imitation learning aims to train multiple agents to perform tasks from demonstrations by learning a mapping between observations and actions.
In this paper, we propose to use copula, a powerful statistical tool for capturing dependence among random variables, to explicitly model the correlation and coordination in multi-agent systems.
Our proposed model is able to separately learn marginals that capture the local behavioral patterns of each individual agent, as well as a copula function that solely and fully captures the dependence structure among agents.
arXiv Detail & Related papers (2021-07-10T03:49:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.