Related papers: MTMed3D: A Multi-Task Transformer-Based Model for 3D Medical Imaging

MTMed3D: A Multi-Task Transformer-Based Model for 3D Medical Imaging

URL: http://arxiv.org/abs/2511.12373v1
Date: Sat, 15 Nov 2025 22:27:49 GMT
Title: MTMed3D: A Multi-Task Transformer-Based Model for 3D Medical Imaging
Authors: Fan Li, Arun Iyengar, Lanyu Xu,
Abstract summary: We propose MTMed3D, a novel end-to-end Multi-task Transformer-based model to address the limitations of single-task models.<n>Our model uses a Transformer as the shared encoder to generate multi-scale features, followed by CNN-based task-specific decoders.
Score: 5.169719124205838
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In the field of medical imaging, AI-assisted techniques such as object detection, segmentation, and classification are widely employed to alleviate the workload of physicians and doctors. However, single-task models are predominantly used, overlooking the shared information across tasks. This oversight leads to inefficiencies in real-life applications. In this work, we propose MTMed3D, a novel end-to-end Multi-task Transformer-based model to address the limitations of single-task models by jointly performing 3D detection, segmentation, and classification in medical imaging. Our model uses a Transformer as the shared encoder to generate multi-scale features, followed by CNN-based task-specific decoders. The proposed framework was evaluated on the BraTS 2018 and 2019 datasets, achieving promising results across all three tasks, especially in detection, where our method achieves better results than prior works. Additionally, we compare our multi-task model with equivalent single-task variants trained separately. Our multi-task model significantly reduces computational costs and achieves faster inference speed while maintaining comparable performance to the single-task models, highlighting its efficiency advantage. To the best of our knowledge, this is the first work to leverage Transformers for multi-task learning that simultaneously covers detection, segmentation, and classification tasks in 3D medical imaging, presenting its potential to enhance diagnostic processes. The code is available at https://github.com/fanlimua/MTMed3D.git.

Related papers

Revisiting 2D Foundation Models for Scalable 3D Medical Image Classification [11.13919196108179]
We introduce AnyMC3D, a scalable 3D classifier adapted from 2D FMs.<n>Our method scales efficiently to new tasks by adding only lightweight plugins on top of a single frozen backbone.<n>Our analysis reveals key insights: (1) effective adaptation is essential to unlock FM potential, (2) general-purpose FMs can match medical-specific FMs if properly adapted, and (3) 2D-based methods surpass 3D architectures for 3D classification.
arXiv Detail & Related papers (2025-12-15T00:01:19Z)
Does DINOv3 Set a New Medical Vision Standard? [67.33543059306938]
This report investigates whether DINOv3 can serve as a powerful unified encoder for medical vision tasks without domain-specific pre-training.<n>We benchmark DINOv3 across common medical vision tasks, including 2D/3D classification and segmentation.<n>Remarkably, it can even outperform medical-specific foundation models like BiomedCLIP and CT-Net on several tasks.
arXiv Detail & Related papers (2025-09-08T09:28:57Z)
MECFormer: Multi-task Whole Slide Image Classification with Expert Consultation Network [2.6954348706500766]
Whole slide image (WSI) classification is a crucial problem for cancer diagnostics in clinics and hospitals. Previous MIL-based models designed for this problem have only been evaluated on individual tasks for specific organs. We propose MECFormer, a generative Transformer-based model designed to handle multiple tasks within one model.
arXiv Detail & Related papers (2024-10-06T14:56:23Z)
MTP: Advancing Remote Sensing Foundation Model via Multi-Task Pretraining [73.81862342673894]
Foundation models have reshaped the landscape of Remote Sensing (RS) by enhancing various image interpretation tasks. transferring the pretrained models to downstream tasks may encounter task discrepancy due to their formulation of pretraining as image classification or object discrimination tasks. We conduct multi-task supervised pretraining on the SAMRS dataset, encompassing semantic segmentation, instance segmentation, and rotated object detection. Our models are finetuned on various RS downstream tasks, such as scene classification, horizontal and rotated object detection, semantic segmentation, and change detection.
arXiv Detail & Related papers (2024-03-20T09:17:22Z)
Masked LoGoNet: Fast and Accurate 3D Image Analysis for Medical Domain [46.44049019428938]
We introduce a new neural network architecture, termed LoGoNet, with a tailored self-supervised learning (SSL) method.<n>LoGoNet integrates a novel feature extractor within a U-shaped architecture, leveraging Large Kernel Attention (LKA) and a dual encoding strategy.<n>We propose a novel SSL method tailored for 3D images to compensate for the lack of large labeled datasets.
arXiv Detail & Related papers (2024-02-09T05:06:58Z)
Promise:Prompt-driven 3D Medical Image Segmentation Using Pretrained Image Foundation Models [13.08275555017179]
We propose ProMISe, a prompt-driven 3D medical image segmentation model using only a single point prompt. We evaluate our model on two public datasets for colon and pancreas tumor segmentations.
arXiv Detail & Related papers (2023-10-30T16:49:03Z)
LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets. We have collected approximately 1.3 million medical images from 55 publicly available datasets. LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z)
Multi-Task Learning of Object State Changes from Uncurated Videos [55.60442251060871]
We learn to temporally localize object state changes by observing people interacting with objects in long uncurated web videos. We show that our multi-task model achieves a relative improvement of 40% over the prior single-task methods. We also test our method on long egocentric videos of the EPIC-KITCHENS and the Ego4D datasets in a zero-shot setup.
arXiv Detail & Related papers (2022-11-24T09:42:46Z)
BEVerse: Unified Perception and Prediction in Birds-Eye-View for Vision-Centric Autonomous Driving [92.05963633802979]
We present BEVerse, a unified framework for 3D perception and prediction based on multi-camera systems. We show that the multi-task BEVerse outperforms single-task methods on 3D object detection, semantic map construction, and motion prediction.
arXiv Detail & Related papers (2022-05-19T17:55:35Z)
MulT: An End-to-End Multitask Learning Transformer [66.52419626048115]
We propose an end-to-end Multitask Learning Transformer framework, named MulT, to simultaneously learn multiple high-level vision tasks. Our framework encodes the input image into a shared representation and makes predictions for each vision task using task-specific transformer-based decoder heads.
arXiv Detail & Related papers (2022-05-17T13:03:18Z)
UNetFormer: A Unified Vision Transformer Model and Pre-Training Framework for 3D Medical Image Segmentation [14.873473285148853]
We introduce a unified framework consisting of two architectures, dubbed UNetFormer, with a 3D Swin Transformer-based encoder and Conal Neural Network (CNN) and transformer-based decoders. In the proposed model, the encoder is linked to the decoder via skip connections at five different resolutions with deep supervision. We present a methodology for self-supervised pre-training of the encoder backbone via learning to predict randomly masked tokens.
arXiv Detail & Related papers (2022-04-01T17:38:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.