Sim4Seg: Boosting Multimodal Multi-disease Medical Diagnosis Segmentation with Region-Aware Vision-Language Similarity Masks
- URL: http://arxiv.org/abs/2511.06665v1
- Date: Mon, 10 Nov 2025 03:22:42 GMT
- Title: Sim4Seg: Boosting Multimodal Multi-disease Medical Diagnosis Segmentation with Region-Aware Vision-Language Similarity Masks
- Authors: Lingran Song, Yucheng Zhou, Jianbing Shen,
- Abstract summary: We introduce a medical vision-language task named Medical Diagnosis (MDS)<n>MDS aims to understand clinical queries for medical images and generate the corresponding segmentation masks as well as diagnostic results.<n>We propose Sim4Seg, a novel framework that improves the performance of diagnosis segmentation.
- Score: 54.00822479127598
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite significant progress in pixel-level medical image analysis, existing medical image segmentation models rarely explore medical segmentation and diagnosis tasks jointly. However, it is crucial for patients that models can provide explainable diagnoses along with medical segmentation results. In this paper, we introduce a medical vision-language task named Medical Diagnosis Segmentation (MDS), which aims to understand clinical queries for medical images and generate the corresponding segmentation masks as well as diagnostic results. To facilitate this task, we first present the Multimodal Multi-disease Medical Diagnosis Segmentation (M3DS) dataset, containing diverse multimodal multi-disease medical images paired with their corresponding segmentation masks and diagnosis chain-of-thought, created via an automated diagnosis chain-of-thought generation pipeline. Moreover, we propose Sim4Seg, a novel framework that improves the performance of diagnosis segmentation by taking advantage of the Region-Aware Vision-Language Similarity to Mask (RVLS2M) module. To improve overall performance, we investigate a test-time scaling strategy for MDS tasks. Experimental results demonstrate that our method outperforms the baselines in both segmentation and diagnosis.
Related papers
- RadDiagSeg-M: A Vision Language Model for Joint Diagnosis and Multi-Target Segmentation in Radiology [5.502516603909592]
We introduce RadDiagSeg-D, a dataset combining abnormality detection, diagnosis, and multi-target segmentation into a unified task.<n>We then propose a novel vision-language model, RadDiagSeg-M, capable of joint abnormality detection, diagnosis, and flexible segmentation.
arXiv Detail & Related papers (2025-10-21T00:28:13Z) - Multimodal Causal-Driven Representation Learning for Generalizable Medical Image Segmentation [56.52520416420957]
We propose Multimodal Causal-Driven Representation Learning (MCDRL) to tackle domain generalization in medical image segmentation.<n>MCDRL consistently outperforms competing methods, yielding superior segmentation accuracy and exhibiting robust generalizability.
arXiv Detail & Related papers (2025-08-07T03:41:41Z) - MedSeg-R: Reasoning Segmentation in Medical Images with Multimodal Large Language Models [48.24824129683951]
We introduce medical image reasoning segmentation, a novel task that aims to generate segmentation masks based on complex and implicit medical instructions.<n>To address this, we propose MedSeg-R, an end-to-end framework that leverages the reasoning abilities of MLLMs to interpret clinical questions.<n>It is built on two core components: 1) a global context understanding module that interprets images and comprehends complex medical instructions to generate multi-modal intermediate tokens, and 2) a pixel-level grounding module that decodes these tokens to produce precise segmentation masks.
arXiv Detail & Related papers (2025-06-12T08:13:38Z) - Dynamically evolving segment anything model with continuous learning for medical image segmentation [50.92344083895528]
We introduce EvoSAM, a dynamically evolving medical image segmentation model.<n>EvoSAM continuously accumulates new knowledge from an ever-expanding array of scenarios and tasks.<n>Experiments conducted by surgical clinicians on blood vessel segmentation confirm that EvoSAM enhances segmentation efficiency based on user prompts.
arXiv Detail & Related papers (2025-03-08T14:37:52Z) - Enhanced MRI Representation via Cross-series Masking [48.09478307927716]
Cross-Series Masking (CSM) Strategy for effectively learning MRI representation in a self-supervised manner.<n>Method achieves state-of-the-art performance on both public and in-house datasets.
arXiv Detail & Related papers (2024-12-10T10:32:09Z) - MedCLIP-SAMv2: Towards Universal Text-Driven Medical Image Segmentation [2.2585213273821716]
We introduce MedCLIP-SAMv2, a novel framework that integrates the CLIP and SAM models to perform segmentation on clinical scans.<n>Our approach includes fine-tuning the BiomedCLIP model with a new Decoupled Hard Negative Noise Contrastive Estimation (DHN-NCE) loss.<n>We also investigate using zero-shot segmentation labels within a weakly supervised paradigm to enhance segmentation quality further.
arXiv Detail & Related papers (2024-09-28T23:10:37Z) - A Transformer-based representation-learning model with unified
processing of multimodal input for clinical diagnostics [63.106382317917344]
We report a Transformer-based representation-learning model as a clinical diagnostic aid that processes multimodal input in a unified manner.
The unified model outperformed an image-only model and non-unified multimodal diagnosis models in the identification of pulmonary diseases.
arXiv Detail & Related papers (2023-06-01T16:23:47Z) - SeATrans: Learning Segmentation-Assisted diagnosis model via Transforme [13.63128987400635]
We propose Vision-Assisted diagnosis Transformer (SeATrans) to transfer the segmentation knowledge to the disease diagnosis network.
We first propose an asymmetric multi-scale interaction strategy to correlate each single low-level diagnosis feature with multi-scale segmentation features.
To model the segmentation-diagnosis interaction, SeA-block first embeds the diagnosis feature based on the segmentation information via the encoder, and then transfers the embedding back to the diagnosis feature space by a decoder.
arXiv Detail & Related papers (2022-06-12T15:10:33Z) - Opinions Vary? Diagnosis First! [5.39322899965008]
In medical image segmentation, images are usually annotated by several different clinical experts.
Computer Vision models often assume there has a unique ground-truth for each of the instance.
We propose a framework taking the diagnosis result as the gold standard, to estimate the segmentation mask upon the multi-rater segmentation labels.
arXiv Detail & Related papers (2022-02-14T06:33:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.