Related papers: WSI-Agents: A Collaborative Multi-Agent System for Multi-Modal Whole Slide Image Analysis

WSI-Agents: A Collaborative Multi-Agent System for Multi-Modal Whole Slide Image Analysis

URL: http://arxiv.org/abs/2507.14680v1
Date: Sat, 19 Jul 2025 16:11:03 GMT
Title: WSI-Agents: A Collaborative Multi-Agent System for Multi-Modal Whole Slide Image Analysis
Authors: Xinheng Lyu, Yuci Liang, Wenting Chen, Meidan Ding, Jiaqi Yang, Guolin Huang, Daokun Zhang, Xiangjian He, Linlin Shen,
Abstract summary: Whole slide images (WSIs) are vital in digital pathology, enabling gigapixel tissue analysis across various pathological tasks.<n>We propose WSI-Agents, a novel collaborative multi-agent system for multi-modal WSI analysis.
Score: 28.548748698432416
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Whole slide images (WSIs) are vital in digital pathology, enabling gigapixel tissue analysis across various pathological tasks. While recent advancements in multi-modal large language models (MLLMs) allow multi-task WSI analysis through natural language, they often underperform compared to task-specific models. Collaborative multi-agent systems have emerged as a promising solution to balance versatility and accuracy in healthcare, yet their potential remains underexplored in pathology-specific domains. To address these issues, we propose WSI-Agents, a novel collaborative multi-agent system for multi-modal WSI analysis. WSI-Agents integrates specialized functional agents with robust task allocation and verification mechanisms to enhance both task-specific accuracy and multi-task versatility through three components: (1) a task allocation module assigning tasks to expert agents using a model zoo of patch and WSI level MLLMs, (2) a verification mechanism ensuring accuracy through internal consistency checks and external validation using pathology knowledge bases and domain-specific models, and (3) a summary module synthesizing the final summary with visual interpretation maps. Extensive experiments on multi-modal WSI benchmarks show WSI-Agents's superiority to current WSI MLLMs and medical agent frameworks across diverse tasks.

Related papers

Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling [83.78874399606379]
We propose MACT, a Multi-Agent Collaboration framework with Test-Time scaling.<n>It comprises four distinct small-scale agents, with clearly defined roles and effective collaboration.<n>It shows superior performance with a smaller parameter scale without sacrificing the ability of general and mathematical tasks.
arXiv Detail & Related papers (2025-08-05T12:52:09Z)
RingMo-Agent: A Unified Remote Sensing Foundation Model for Multi-Platform and Multi-Modal Reasoning [15.670921552151775]
RingMo-Agent is designed to handle multi-modal and multi-platform data.<n>It is supported by a large-scale vision-language dataset named RS-VL3M.<n>It proves effective in both visual understanding and sophisticated analytical tasks.
arXiv Detail & Related papers (2025-07-28T12:39:33Z)
A Versatile Pathology Co-pilot via Reasoning Enhanced Multimodal Large Language Model [26.704101714550827]
We present SmartPath-R1, a versatile MLLM capable of simultaneously addressing both ROI-level and WSI-level tasks.<n>Our framework combines scale-dependent supervised fine-tuning and task-aware reinforcement fine-tuning, which circumvents the requirement for chain-of-thought supervision.
arXiv Detail & Related papers (2025-07-23T08:09:42Z)
AgentOrchestra: A Hierarchical Multi-Agent Framework for General-Purpose Task Solving [30.50203052125566]
projectname is a hierarchical multi-agent framework for general-purpose task solving.<n>projectname features a central planning agent that decomposes complex objectives and delegates sub-tasks to a team of specialized agents.<n>Each sub-agent is equipped with general programming and analytical tools, as well as abilities to tackle a wide range of real-world specific tasks.
arXiv Detail & Related papers (2025-06-14T13:45:37Z)
Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks [94.19506319646376]
We introduce Agent-X, a benchmark for evaluating vision-centric agents in real-world, multimodal settings.<n>Agent-X features 828 agentic tasks with authentic visual contexts, including images, multi-image comparisons, videos, and instructional text.<n>Our results reveal that even the best-performing models, including GPT, Gemini, and Qwen families, struggle to solve multi-step vision tasks.
arXiv Detail & Related papers (2025-05-30T17:59:53Z)
Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective [42.832839189236694]
We propose MAMMQA, a multi-agent QA framework for multimodal inputs spanning text, tables, and images.<n>Our system includes two Visual Language Model (VLM) agents and one text-based Large Language Model (LLM) agent.<n> Experiments on diverse multimodal QA benchmarks demonstrate that our cooperative, multi-agent framework consistently outperforms existing baselines in both accuracy and robustness.
arXiv Detail & Related papers (2025-05-27T07:23:38Z)
Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models [70.41727912081463]
Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images.<n>We propose a framework to equip MLLMs with robust multi-frame spatial understanding by integrating depth perception, visual correspondence, and dynamic perception.<n>Our model, Multi-SpatialMLLM, achieves significant gains over baselines and proprietary systems, demonstrating scalable, generalizable multi-frame reasoning.
arXiv Detail & Related papers (2025-05-22T17:59:39Z)
M3-AGIQA: Multimodal, Multi-Round, Multi-Aspect AI-Generated Image Quality Assessment [65.3860007085689]
M3-AGIQA is a comprehensive framework that enables more human-aligned, holistic evaluation of AI-generated images.<n>By aligning model outputs more closely with human judgment, M3-AGIQA delivers robust and interpretable quality scores.
arXiv Detail & Related papers (2025-02-21T03:05:45Z)
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment [58.94611347128066]
multimodal large language models (MLLMs) struggle with fine-grained or precise understanding of visuals.<n>Recent studies either develop tool-using or unify specific visual tasks into the autoregressive framework, often at the expense of overall multimodal performance.<n>We propose Task Preference Optimization (TPO), a novel method that utilizes differentiable task preferences derived from typical fine-grained visual tasks.
arXiv Detail & Related papers (2024-12-26T18:56:05Z)
A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks [74.52259252807191]
Multimodal Large Language Models (MLLMs) address the complexities of real-world applications far beyond the capabilities of single-modality systems. This paper systematically sorts out the applications of MLLM in multimodal tasks such as natural language, vision, and audio.
arXiv Detail & Related papers (2024-08-02T15:14:53Z)
Towards Multi-Objective High-Dimensional Feature Selection via Evolutionary Multitasking [63.91518180604101]
This paper develops a novel EMT framework for high-dimensional feature selection problems, namely MO-FSEMT. A task-specific knowledge transfer mechanism is designed to leverage the advantage information of each task, enabling the discovery and effective transmission of high-quality solutions.
arXiv Detail & Related papers (2024-01-03T06:34:39Z)
MulGT: Multi-task Graph-Transformer with Task-aware Knowledge Injection and Domain Knowledge-driven Pooling for Whole Slide Image Analysis [17.098951643252345]
Whole slide image (WSI) has been widely used to assist automated diagnosis under the deep learning fields. We present a novel multi-task framework (i.e., MulGT) for WSI analysis by the specially designed Graph-Transformer.
arXiv Detail & Related papers (2023-02-21T10:00:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.