Transformers in Medicine: Improving Vision-Language Alignment for Medical Image Captioning
- URL: http://arxiv.org/abs/2510.25164v2
- Date: Fri, 31 Oct 2025 01:57:17 GMT
- Title: Transformers in Medicine: Improving Vision-Language Alignment for Medical Image Captioning
- Authors: Yogesh Thakku Suresh, Vishwajeet Shivaji Hogale, Luca-Alexandru Zamfira, Anandavardhana Hegde,
- Abstract summary: We present a transformer-based framework for generating clinically relevant captions for MRI scans.<n>Our system combines a DEiT-Small vision transformer as an image encoder, MediCareBERT for caption embedding, and a custom LSTM-based decoder.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a transformer-based multimodal framework for generating clinically relevant captions for MRI scans. Our system combines a DEiT-Small vision transformer as an image encoder, MediCareBERT for caption embedding, and a custom LSTM-based decoder. The architecture is designed to semantically align image and textual embeddings, using hybrid cosine-MSE loss and contrastive inference via vector similarity. We benchmark our method on the MultiCaRe dataset, comparing performance on filtered brain-only MRIs versus general MRI images against state-of-the-art medical image captioning methods including BLIP, R2GenGPT, and recent transformer-based approaches. Results show that focusing on domain-specific data improves caption accuracy and semantic alignment. Our work proposes a scalable, interpretable solution for automated medical image reporting.
Related papers
- DIST-CLIP: Arbitrary Metadata and Image Guided MRI Harmonization via Disentangled Anatomy-Contrast Representations [0.7201894411169433]
We propose DIST-CLIP (Disentangled Style Transfer with CLIP Guidance), a unified framework for MRI harmonization.<n>Our framework explicitly disentangles anatomical content from image contrast, with the contrast representations being extracted using pre-trained CLIP encoders.<n>We trained and evaluated DIST-CLIP on diverse real-world clinical datasets, and showed significant improvements in performance.
arXiv Detail & Related papers (2025-12-08T16:09:10Z) - RadIR: A Scalable Framework for Multi-Grained Medical Image Retrieval via Radiology Report Mining [64.66825253356869]
We propose a novel methodology that leverages dense radiology reports to define image-wise similarity ordering at multiple granularities.<n>We construct two comprehensive medical imaging retrieval datasets: MIMIC-IR for Chest X-rays and CTRATE-IR for CT scans.<n>We develop two retrieval systems, RadIR-CXR and model-ChestCT, which demonstrate superior performance in traditional image-image and image-report retrieval tasks.
arXiv Detail & Related papers (2025-03-06T17:43:03Z) - Rethinking Perceptual Metrics for Medical Image Translation [11.930968669340864]
We show that perceptual metrics do not correlate with segmentation metrics due to them extending poorly to the anatomical constraints of this sub-field.
We find that the lesser-used pixel-level SWD metric may be useful for subtle intra-modality translation.
arXiv Detail & Related papers (2024-04-10T19:39:43Z) - ContourDiff: Unpaired Image-to-Image Translation with Structural Consistency for Medical Imaging [14.487188068402178]
We introduce a novel metric to quantify the structural bias between domains which must be considered for proper translation.
We then propose ContourDiff, a novel image-to-image translation algorithm that leverages domain-invariant anatomical contour representations.
We evaluate our method on challenging lumbar spine and hip-and-thigh CT-to-MRI translation tasks.
arXiv Detail & Related papers (2024-03-16T03:33:52Z) - Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval [68.61855682218298]
Cross-modal retrieval methods employ two-stream encoders with different architectures for images and texts.
Inspired by recent advances of Transformers in vision tasks, we propose to unify the encoder architectures with Transformers for both modalities.
We design a cross-modal retrieval framework purely based on two-stream Transformers, dubbed textbfHierarchical Alignment Transformers (HAT), which consists of an image Transformer, a text Transformer, and a hierarchical alignment module.
arXiv Detail & Related papers (2023-08-08T15:43:59Z) - Disruptive Autoencoders: Leveraging Low-level features for 3D Medical
Image Pre-training [51.16994853817024]
This work focuses on designing an effective pre-training framework for 3D radiology images.
We introduce Disruptive Autoencoders, a pre-training framework that attempts to reconstruct the original image from disruptions created by a combination of local masking and low-level perturbations.
The proposed pre-training framework is tested across multiple downstream tasks and achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-07-31T17:59:42Z) - Few-shot Medical Image Segmentation via Cross-Reference Transformer [3.2634122554914]
Few-shot segmentation(FSS) has the potential to address these challenges by learning new categories from a small number of labeled samples.
We propose a novel self-supervised few shot medical image segmentation network with Cross-Reference Transformer.
Experimental results show that the proposed model achieves good results on both CT dataset and MRI dataset.
arXiv Detail & Related papers (2023-04-19T13:05:18Z) - BrainCLIP: Bridging Brain and Visual-Linguistic Representation Via CLIP
for Generic Natural Visual Stimulus Decoding [51.911473457195555]
BrainCLIP is a task-agnostic fMRI-based brain decoding model.
It bridges the modality gap between brain activity, image, and text.
BrainCLIP can reconstruct visual stimuli with high semantic fidelity.
arXiv Detail & Related papers (2023-02-25T03:28:54Z) - Vision Transformers in Medical Imaging: A Review [0.0]
Transformer, a model comprising attention-based encoder-decoder architecture, have gained prevalence in the field of natural language processing (NLP)
In this paper, we attempt to provide a comprehensive and recent review on the application of transformers in medical imaging by; describing the transformer model comparing it with a diversity of convolutional neural networks (CNNs)
arXiv Detail & Related papers (2022-11-18T05:52:37Z) - Attentive Symmetric Autoencoder for Brain MRI Segmentation [56.02577247523737]
We propose a novel Attentive Symmetric Auto-encoder based on Vision Transformer (ViT) for 3D brain MRI segmentation tasks.
In the pre-training stage, the proposed auto-encoder pays more attention to reconstruct the informative patches according to the gradient metrics.
Experimental results show that our proposed attentive symmetric auto-encoder outperforms the state-of-the-art self-supervised learning methods and medical image segmentation models.
arXiv Detail & Related papers (2022-09-19T09:43:19Z) - Medical Transformer: Gated Axial-Attention for Medical Image
Segmentation [73.98974074534497]
We study the feasibility of using Transformer-based network architectures for medical image segmentation tasks.
We propose a Gated Axial-Attention model which extends the existing architectures by introducing an additional control mechanism in the self-attention module.
To train the model effectively on medical images, we propose a Local-Global training strategy (LoGo) which further improves the performance.
arXiv Detail & Related papers (2021-02-21T18:35:14Z) - TransUNet: Transformers Make Strong Encoders for Medical Image
Segmentation [78.01570371790669]
Medical image segmentation is an essential prerequisite for developing healthcare systems.
On various medical image segmentation tasks, the u-shaped architecture, also known as U-Net, has become the de-facto standard.
We propose TransUNet, which merits both Transformers and U-Net, as a strong alternative for medical image segmentation.
arXiv Detail & Related papers (2021-02-08T16:10:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.