MedITok: A Unified Tokenizer for Medical Image Synthesis and Interpretation
- URL: http://arxiv.org/abs/2505.19225v1
- Date: Sun, 25 May 2025 16:39:35 GMT
- Title: MedITok: A Unified Tokenizer for Medical Image Synthesis and Interpretation
- Authors: Chenglong Ma, Yuanfeng Ji, Jin Ye, Zilong Li, Chenhui Wang, Junzhi Ning, Wei Li, Lihao Liu, Qiushan Guo, Tianbin Li, Junjun He, Hongming Shan,
- Abstract summary: We present MedITok, the first unified tokenizer tailored for medical images.<n>It encodes both low-level structural details and high-level clinical semantics within a unified latent space.<n>It achieves state-of-the-art performance on more than 30 datasets across 9 imaging modalities and 4 different tasks.
- Score: 23.783507307500116
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Advanced autoregressive models have reshaped multimodal AI. However, their transformative potential in medical imaging remains largely untapped due to the absence of a unified visual tokenizer -- one capable of capturing fine-grained visual structures for faithful image reconstruction and realistic image synthesis, as well as rich semantics for accurate diagnosis and image interpretation. To this end, we present MedITok, the first unified tokenizer tailored for medical images, encoding both low-level structural details and high-level clinical semantics within a unified latent space. To balance these competing objectives, we introduce a novel two-stage training framework: a visual representation alignment stage that cold-starts the tokenizer reconstruction learning with a visual semantic constraint, followed by a textual semantic representation alignment stage that infuses detailed clinical semantics into the latent space. Trained on the meticulously collected large-scale dataset with over 30 million medical images and 2 million image-caption pairs, MedITok achieves state-of-the-art performance on more than 30 datasets across 9 imaging modalities and 4 different tasks. By providing a unified token space for autoregressive modeling, MedITok supports a wide range of tasks in clinical diagnostics and generative healthcare applications. Model and code will be made publicly available at: https://github.com/Masaaki-75/meditok.
Related papers
- MedSeg-R: Reasoning Segmentation in Medical Images with Multimodal Large Language Models [48.24824129683951]
We introduce medical image reasoning segmentation, a novel task that aims to generate segmentation masks based on complex and implicit medical instructions.<n>To address this, we propose MedSeg-R, an end-to-end framework that leverages the reasoning abilities of MLLMs to interpret clinical questions.<n>It is built on two core components: 1) a global context understanding module that interprets images and comprehends complex medical instructions to generate multi-modal intermediate tokens, and 2) a pixel-level grounding module that decodes these tokens to produce precise segmentation masks.
arXiv Detail & Related papers (2025-06-12T08:13:38Z) - Reinforced Correlation Between Vision and Language for Precise Medical AI Assistant [11.187690318227514]
RCMed is a full-stack AI assistant that improves multimodal alignment in both input and output.<n>It achieves state-of-the-art precision in contextualizing irregular lesions and subtle anatomical boundaries.
arXiv Detail & Related papers (2025-05-06T10:00:08Z) - MedIL: Implicit Latent Spaces for Generating Heterogeneous Medical Images at Arbitrary Resolutions [2.2427832125073732]
MedIL is a first-of-its-kind autoencoder built for encoding medical images with heterogeneous sizes and resolutions.<n>We show how MedIL compresses and preserves clinically-relevant features over large multi-site, multi-resolution datasets.
arXiv Detail & Related papers (2025-04-12T19:52:56Z) - An Explainable Biomedical Foundation Model via Large-Scale Concept-Enhanced Vision-Language Pre-training [40.16314726875265]
ConceptCLIP is the first explainable biomedical foundation model that achieves state-of-the-art diagnostic accuracy.<n>We develop ConceptCLIP through a novel dual-alignment approach that simultaneously learns global image-text representations and fine-grained region-concept associations.
arXiv Detail & Related papers (2025-01-26T16:07:11Z) - Autoregressive Sequence Modeling for 3D Medical Image Representation [48.706230961589924]
We introduce a pioneering method for learning 3D medical image representations through an autoregressive sequence pre-training framework.<n>Our approach various 3D medical images based on spatial, contrast, and semantic correlations, treating them as interconnected visual tokens within a token sequence.
arXiv Detail & Related papers (2024-09-13T10:19:10Z) - Unlocking the Power of Spatial and Temporal Information in Medical Multimodal Pre-training [99.2891802841936]
We introduce the Med-ST framework for fine-grained spatial and temporal modeling.
For spatial modeling, Med-ST employs the Mixture of View Expert (MoVE) architecture to integrate different visual features from both frontal and lateral views.
For temporal modeling, we propose a novel cross-modal bidirectional cycle consistency objective by forward mapping classification (FMC) and reverse mapping regression (RMR)
arXiv Detail & Related papers (2024-05-30T03:15:09Z) - Building Universal Foundation Models for Medical Image Analysis with
Spatially Adaptive Networks [5.661631789478932]
We propose a universal foundation model for medical image analysis that processes images with heterogeneous spatial properties using a unified structure.
We pre-train a spatial adaptive visual tokenizer (SPAD-VT) and then a spatial adaptive Vision Transformer (SPAD-ViT) via masked image modeling (MIM) on 55 public medical image datasets.
The experimental results on downstream medical image classification and segmentation tasks demonstrate the superior performance and label efficiency of our model.
arXiv Detail & Related papers (2023-12-12T08:33:45Z) - Disruptive Autoencoders: Leveraging Low-level features for 3D Medical
Image Pre-training [51.16994853817024]
This work focuses on designing an effective pre-training framework for 3D radiology images.
We introduce Disruptive Autoencoders, a pre-training framework that attempts to reconstruct the original image from disruptions created by a combination of local masking and low-level perturbations.
The proposed pre-training framework is tested across multiple downstream tasks and achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-07-31T17:59:42Z) - LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical
Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets.
We have collected approximately 1.3 million medical images from 55 publicly available datasets.
LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z) - Training Like a Medical Resident: Context-Prior Learning Toward Universal Medical Image Segmentation [38.61227663176952]
We propose a shift towards universal medical image segmentation, a paradigm aiming to build medical image understanding foundation models.
We develop Hermes, a novel context-prior learning approach to address the challenges of data heterogeneity and annotation differences in medical image segmentation.
arXiv Detail & Related papers (2023-06-04T17:39:08Z) - MedSegDiff-V2: Diffusion based Medical Image Segmentation with
Transformer [53.575573940055335]
We propose a novel Transformer-based Diffusion framework, called MedSegDiff-V2.
We verify its effectiveness on 20 medical image segmentation tasks with different image modalities.
arXiv Detail & Related papers (2023-01-19T03:42:36Z) - CUTS: A Deep Learning and Topological Framework for Multigranular Unsupervised Medical Image Segmentation [8.307551496968156]
We present CUTS, an unsupervised deep learning framework for medical image segmentation.
For each image, it produces an embedding map via intra-image contrastive learning and local patch reconstruction.
CUTS yields a series of coarse-to-fine-grained segmentations that highlight features at various granularities.
arXiv Detail & Related papers (2022-09-23T01:09:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.