MedUniSeg: 2D and 3D Medical Image Segmentation via a Prompt-driven Universal Model
- URL: http://arxiv.org/abs/2410.05905v1
- Date: Tue, 8 Oct 2024 11:04:01 GMT
- Title: MedUniSeg: 2D and 3D Medical Image Segmentation via a Prompt-driven Universal Model
- Authors: Yiwen Ye, Ziyang Chen, Jianpeng Zhang, Yutong Xie, Yong Xia,
- Abstract summary: We introduce MedUniSeg, a prompt-driven universal segmentation model for 2D and 3D multi-task segmentation.
MedUniSeg employs multiple modal-specific prompts alongside a universal task prompt to accurately characterize the modalities and tasks.
We evaluate MedUniSeg on a comprehensive multi-modal upstream dataset consisting of 17 sub-datasets.
- Score: 27.58715707047272
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Universal segmentation models offer significant potential in addressing a wide range of tasks by effectively leveraging discrete annotations. As the scope of tasks and modalities expands, it becomes increasingly important to generate and strategically position task- and modal-specific priors within the universal model. However, existing universal models often overlook the correlations between different priors, and the optimal placement and frequency of these priors remain underexplored. In this paper, we introduce MedUniSeg, a prompt-driven universal segmentation model designed for 2D and 3D multi-task segmentation across diverse modalities and domains. MedUniSeg employs multiple modal-specific prompts alongside a universal task prompt to accurately characterize the modalities and tasks. To generate the related priors, we propose the modal map (MMap) and the fusion and selection (FUSE) modules, which transform modal and task prompts into corresponding priors. These modal and task priors are systematically introduced at the start and end of the encoding process. We evaluate MedUniSeg on a comprehensive multi-modal upstream dataset consisting of 17 sub-datasets. The results demonstrate that MedUniSeg achieves superior multi-task segmentation performance, attaining a 1.2% improvement in the mean Dice score across the 17 upstream tasks compared to nnUNet baselines, while using less than 1/10 of the parameters. For tasks that underperform during the initial multi-task joint training, we freeze MedUniSeg and introduce new modules to re-learn these tasks. This approach yields an enhanced version, MedUniSeg*, which consistently outperforms MedUniSeg across all tasks. Moreover, MedUniSeg surpasses advanced self-supervised and supervised pre-trained models on six downstream tasks, establishing itself as a high-quality, highly generalizable pre-trained segmentation model.
Related papers
- From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons [85.99268361356832]
We introduce a process of adapting an MLLM to a Generalist Embodied Agent (GEA)
GEA is a single unified model capable of grounding itself across varied domains through a multi-embodiment action tokenizer.
Our findings reveal the importance of training with cross-domain data and online RL for building generalist agents.
arXiv Detail & Related papers (2024-12-11T15:06:25Z) - DriveMM: All-in-One Large Multimodal Model for Autonomous Driving [63.882827922267666]
DriveMM is a large multimodal model designed to process diverse data inputs, such as images and multi-view videos, while performing a broad spectrum of autonomous driving tasks.
We conduct evaluations on six public benchmarks and undertake zero-shot transfer on an unseen dataset, where DriveMM achieves state-of-the-art performance across all tasks.
arXiv Detail & Related papers (2024-12-10T17:27:32Z) - One Framework to Rule Them All: Unifying Multimodal Tasks with LLM Neural-Tuning [16.96824902454355]
We propose a unified framework that concurrently handles multiple tasks and modalities.
In this framework, all modalities and tasks are represented as unified tokens and trained using a single, consistent approach.
We present a new benchmark, MMUD, which includes samples annotated with multiple task labels.
We demonstrate the ability to handle multiple tasks simultaneously in a streamlined and efficient manner.
arXiv Detail & Related papers (2024-08-06T07:19:51Z) - Enhancing Visual Grounding and Generalization: A Multi-Task Cycle Training Approach for Vision-Language Models [41.64717254672843]
Visual grounding occupies a pivotal position in multi-modality vision-language models.
We propose ViLaM, a large multi-modality model, that supports multi-tasks of VG.
ViLaM extends a wide range of instructions, thereby significantly enhancing its generalization and interaction potentials.
arXiv Detail & Related papers (2023-11-21T03:40:09Z) - AdaMerging: Adaptive Model Merging for Multi-Task Learning [68.75885518081357]
This paper introduces an innovative technique called Adaptive Model Merging (AdaMerging)
It aims to autonomously learn the coefficients for model merging, either in a task-wise or layer-wise manner, without relying on the original training data.
Compared to the current state-of-the-art task arithmetic merging scheme, AdaMerging showcases a remarkable 11% improvement in performance.
arXiv Detail & Related papers (2023-10-04T04:26:33Z) - UniSeg: A Prompt-driven Universal Segmentation Model as well as A Strong
Representation Learner [32.698493660851035]
We propose a prompt-driven universal model (UniSeg) for multi-task medical image segmentation.
We make the model 'aware' of the ongoing task early and boost the task-specific training of the whole decoder.
Our results indicate that the proposed UniSeg outperforms other universal models and single-task models on 11 upstream tasks.
arXiv Detail & Related papers (2023-04-07T06:28:51Z) - OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist
Models [72.8156832931841]
Generalist models are capable of performing diverse multi-modal tasks in a task-agnostic way within a single model.
We release a generalist model learning system, OFASys, built on top of a declarative task interface named multi-modal instruction.
arXiv Detail & Related papers (2022-12-08T17:07:09Z) - Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and
Vision-Language Tasks [86.66733026149892]
We propose Uni-Perceiver v2, which is the first generalist model capable of handling major large-scale vision and vision-gnostic tasks.
Specifically, images are encoded as general region proposals, while texts are encoded via a Transformer-based language model.
Uni-Perceiver v2 achieves competitive performance on a broad range of vision and vision-language tasks.
arXiv Detail & Related papers (2022-11-17T18:59:52Z) - Uni-Perceiver: Pre-training Unified Architecture for Generic Perception
for Zero-shot and Few-shot Tasks [73.63892022944198]
We present a generic perception architecture named Uni-Perceiver.
It processes a variety of modalities and tasks with unified modeling and shared parameters.
Results show that our pre-trained model without any tuning can achieve reasonable performance even on novel tasks.
arXiv Detail & Related papers (2021-12-02T18:59:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.