GNN-ViTCap: GNN-Enhanced Multiple Instance Learning with Vision Transformers for Whole Slide Image Classification and Captioning
- URL: http://arxiv.org/abs/2507.07006v1
- Date: Wed, 09 Jul 2025 16:35:21 GMT
- Title: GNN-ViTCap: GNN-Enhanced Multiple Instance Learning with Vision Transformers for Whole Slide Image Classification and Captioning
- Authors: S M Taslim Uddin Raju, Md. Milon Islam, Md Rezwanul Haque, Hamdi Altaheri, Fakhri Karray,
- Abstract summary: Whole Slide Image (WSI) classification and captioning have become crucial tasks in computer-aided pathology.<n>We introduce a novel GNN-ViTCap framework for classification and caption generation from histopathology images.<n>GNN-ViTCap achieves an F1 score of 0.934 and an AUC of 0.963 for classification, along with a BLEU-4 score of 0.811 and a METEOR score of 0.569 for captioning.
- Score: 1.25828876338076
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Microscopic assessment of histopathology images is vital for accurate cancer diagnosis and treatment. Whole Slide Image (WSI) classification and captioning have become crucial tasks in computer-aided pathology. However, microscopic WSI face challenges such as redundant patches and unknown patch positions due to subjective pathologist captures. Moreover, generating automatic pathology captions remains a significant challenge. To address these issues, we introduce a novel GNN-ViTCap framework for classification and caption generation from histopathological microscopic images. First, a visual feature extractor generates patch embeddings. Redundant patches are then removed by dynamically clustering these embeddings using deep embedded clustering and selecting representative patches via a scalar dot attention mechanism. We build a graph by connecting each node to its nearest neighbors in the similarity matrix and apply a graph neural network to capture both local and global context. The aggregated image embeddings are projected into the language model's input space through a linear layer and combined with caption tokens to fine-tune a large language model. We validate our method on the BreakHis and PatchGastric datasets. GNN-ViTCap achieves an F1 score of 0.934 and an AUC of 0.963 for classification, along with a BLEU-4 score of 0.811 and a METEOR score of 0.569 for captioning. Experimental results demonstrate that GNN-ViTCap outperforms state of the art approaches, offering a reliable and efficient solution for microscopy based patient diagnosis.
Related papers
- From Pixels to Histopathology: A Graph-Based Framework for Interpretable Whole Slide Image Analysis [81.19923502845441]
We develop a graph-based framework that constructs WSI graph representations.<n>We build tissue representations (nodes) that follow biological boundaries rather than arbitrary patches.<n>In our method's final step, we solve the diagnostic task through a graph attention network.
arXiv Detail & Related papers (2025-03-14T20:15:04Z) - PathAlign: A vision-language model for whole slide images in histopathology [13.567674461880905]
We develop a vision-language model based on the BLIP-2 framework using WSIs and curated text from pathology reports.
This enables applications utilizing a shared image-text embedding space, such as text or image retrieval for finding cases of interest.
We present pathologist evaluation of text generation and text retrieval using WSI embeddings, as well as results for WSI classification and workflow prioritization.
arXiv Detail & Related papers (2024-06-27T23:43:36Z) - A self-supervised framework for learning whole slide representations [52.774822784847565]
We present Slide Pre-trained Transformers (SPT) for gigapixel-scale self-supervision of whole slide images.
We benchmark SPT visual representations on five diagnostic tasks across three biomedical microscopy datasets.
arXiv Detail & Related papers (2024-02-09T05:05:28Z) - Automatic Report Generation for Histopathology images using pre-trained Vision Transformers and BERT [1.0819408603463427]
We show that using an existing pre-trained Vision Transformer (ViT) to encode 4096x4096 sized patches of the Whole Slide Image (WSI) and a pre-trained Bidirectional Representations from Transformers (BERT) model for report generation, we can build a performant and portable report generation mechanism.
Our method allows us to not only generate and evaluate captions that describe the image, but also helps us classify the image into tissue types and the gender of the patient as well.
arXiv Detail & Related papers (2023-12-03T15:56:09Z) - WsiCaption: Multiple Instance Generation of Pathology Reports for Gigapixel Whole-Slide Images [5.960501267687475]
We investigate how to generate pathology reports given whole slide images.
We curated the largest WSI-text dataset (PathText)
On the model end, we propose the multiple instance generative model (MI-Gen)
arXiv Detail & Related papers (2023-11-27T05:05:41Z) - AMIGO: Sparse Multi-Modal Graph Transformer with Shared-Context
Processing for Representation Learning of Giga-pixel Images [53.29794593104923]
We present a novel concept of shared-context processing for whole slide histopathology images.
AMIGO uses the celluar graph within the tissue to provide a single representation for a patient.
We show that our model is strongly robust to missing information to an extent that it can achieve the same performance with as low as 20% of the data.
arXiv Detail & Related papers (2023-03-01T23:37:45Z) - Hierarchical Transformer for Survival Prediction Using Multimodality
Whole Slide Images and Genomics [63.76637479503006]
Learning good representation of giga-pixel level whole slide pathology images (WSI) for downstream tasks is critical.
This paper proposes a hierarchical-based multimodal transformer framework that learns a hierarchical mapping between pathology images and corresponding genes.
Our architecture requires fewer GPU resources compared with benchmark methods while maintaining better WSI representation ability.
arXiv Detail & Related papers (2022-11-29T23:47:56Z) - Two-Stream Graph Convolutional Network for Intra-oral Scanner Image
Segmentation [133.02190910009384]
We propose a two-stream graph convolutional network (i.e., TSGCN) to handle inter-view confusion between different raw attributes.
Our TSGCN significantly outperforms state-of-the-art methods in 3D tooth (surface) segmentation.
arXiv Detail & Related papers (2022-04-19T10:41:09Z) - Weakly-supervised Generative Adversarial Networks for medical image
classification [1.479639149658596]
We propose a novel medical image classification algorithm called Weakly-Supervised Generative Adversarial Networks (WSGAN)
WSGAN only uses a small number of real images without labels to generate fake images or mask images to enlarge the sample size of the training set.
We show that WSGAN can obtain relatively high learning performance by using few labeled and unlabeled data.
arXiv Detail & Related papers (2021-11-29T15:38:48Z) - Graph Neural Networks for UnsupervisedDomain Adaptation of
Histopathological ImageAnalytics [22.04114134677181]
We present a novel method for the unsupervised domain adaptation for histological image analysis.
It is based on a backbone for embedding images into a feature space, and a graph neural layer for propa-gating the supervision signals of images with labels.
In experiments, our methodachieves state-of-the-art performance on four public datasets.
arXiv Detail & Related papers (2020-08-21T04:53:44Z) - High-Order Information Matters: Learning Relation and Topology for
Occluded Person Re-Identification [84.43394420267794]
We propose a novel framework by learning high-order relation and topology information for discriminative features and robust alignment.
Our framework significantly outperforms state-of-the-art by6.5%mAP scores on Occluded-Duke dataset.
arXiv Detail & Related papers (2020-03-18T12:18:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.