Visual concept ranking uncovers medical shortcuts used by large multimodal models
- URL: http://arxiv.org/abs/2602.05096v1
- Date: Wed, 04 Feb 2026 22:27:34 GMT
- Title: Visual concept ranking uncovers medical shortcuts used by large multimodal models
- Authors: Joseph D. Janizek, Sonnet Xu, Junayd Lateef, Roxana Daneshjou,
- Abstract summary: We introduce a method for identifying important visual concepts within large multimodal models (LMMs)<n>We primarily focus on the task of classifying malignant skin lesions from clinical dermatology images.
- Score: 1.1082922912570348
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Ensuring the reliability of machine learning models in safety-critical domains such as healthcare requires auditing methods that can uncover model shortcomings. We introduce a method for identifying important visual concepts within large multimodal models (LMMs) and use it to investigate the behaviors these models exhibit when prompted with medical tasks. We primarily focus on the task of classifying malignant skin lesions from clinical dermatology images, with supplemental experiments including both chest radiographs and natural images. After showing how LMMs display unexpected gaps in performance between different demographic subgroups when prompted with demonstrating examples, we apply our method, Visual Concept Ranking (VCR), to these models and prompts. VCR generates hypotheses related to different visual feature dependencies, which we are then able to validate with manual interventions.
Related papers
- Location-Aware Pretraining for Medical Difference Visual Question Answering [14.75114843903826]
We introduce a pretraining framework that incorporates location-aware tasks.<n>These specific tasks enable the vision encoder to learn fine-grained, spatially grounded visual representations.<n>We subsequently integrate this enhanced vision encoder with a language model to perform medical difference VQA.
arXiv Detail & Related papers (2026-03-05T08:44:06Z) - Multimodal Causal-Driven Representation Learning for Generalizable Medical Image Segmentation [56.52520416420957]
We propose Multimodal Causal-Driven Representation Learning (MCDRL) to tackle domain generalization in medical image segmentation.<n>MCDRL consistently outperforms competing methods, yielding superior segmentation accuracy and exhibiting robust generalizability.
arXiv Detail & Related papers (2025-08-07T03:41:41Z) - Unsupervised Model Diagnosis [49.36194740479798]
This paper proposes Unsupervised Model Diagnosis (UMO) to produce semantic counterfactual explanations without any user guidance.
Our approach identifies and visualizes changes in semantics, and then matches these changes to attributes from wide-ranging text sources.
arXiv Detail & Related papers (2024-10-08T17:59:03Z) - Cross-model Mutual Learning for Exemplar-based Medical Image Segmentation [25.874281336821685]
Cross-model Mutual learning framework for Exemplar-based Medical image (CMEMS)
We introduce a novel Cross-model Mutual learning framework for Exemplar-based Medical image (CMEMS)
arXiv Detail & Related papers (2024-04-18T00:18:07Z) - Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual tokens, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary.
We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z) - Robust and Interpretable Medical Image Classifiers via Concept
Bottleneck Models [49.95603725998561]
We propose a new paradigm to build robust and interpretable medical image classifiers with natural language concepts.
Specifically, we first query clinical concepts from GPT-4, then transform latent image features into explicit concepts with a vision-language model.
arXiv Detail & Related papers (2023-10-04T21:57:09Z) - Enhancing Representation in Radiography-Reports Foundation Model: A Granular Alignment Algorithm Using Masked Contrastive Learning [26.425784890859738]
MaCo is a masked contrastive chest X-ray foundation model.
It simultaneously achieves fine-grained image understanding and zero-shot learning for a variety of medical imaging tasks.
It is shown to be superior over 10 state-of-the-art approaches across tasks such as classification, segmentation, detection, and phrase grounding.
arXiv Detail & Related papers (2023-09-12T01:29:37Z) - A Transformer-based representation-learning model with unified
processing of multimodal input for clinical diagnostics [63.106382317917344]
We report a Transformer-based representation-learning model as a clinical diagnostic aid that processes multimodal input in a unified manner.
The unified model outperformed an image-only model and non-unified multimodal diagnosis models in the identification of pulmonary diseases.
arXiv Detail & Related papers (2023-06-01T16:23:47Z) - Ambiguous Medical Image Segmentation using Diffusion Models [60.378180265885945]
We introduce a single diffusion model-based approach that produces multiple plausible outputs by learning a distribution over group insights.
Our proposed model generates a distribution of segmentation masks by leveraging the inherent sampling process of diffusion.
Comprehensive results show that our proposed approach outperforms existing state-of-the-art ambiguous segmentation networks.
arXiv Detail & Related papers (2023-04-10T17:58:22Z) - Towards Trustable Skin Cancer Diagnosis via Rewriting Model's Decision [12.306688233127312]
We introduce a human-in-the-loop framework in the model training process.
Our method can automatically discover confounding factors.
It is capable of learning confounding concepts using easily obtained concept exemplars.
arXiv Detail & Related papers (2023-03-02T01:02:18Z) - TorchEsegeta: Framework for Interpretability and Explainability of
Image-based Deep Learning Models [0.0]
Clinicians are often sceptical about applying automatic image processing approaches, especially deep learning based methods, in practice.
This paper presents approaches that help to interpret and explain the results of deep learning algorithms by depicting the anatomical areas which influence the decision of the algorithm most.
Research presents a unified framework, TorchEsegeta, for applying various interpretability and explainability techniques for deep learning models.
arXiv Detail & Related papers (2021-10-16T01:00:15Z) - A Question-Centric Model for Visual Question Answering in Medical
Imaging [3.619444603816032]
We present a novel Visual Question Answering approach that allows an image to be queried by means of a written question.
Experiments on a variety of medical and natural image datasets show that by fusing image and question features in a novel way, the proposed approach achieves an equal or higher accuracy compared to current methods.
arXiv Detail & Related papers (2020-03-02T10:16:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.