Bi-VLGM : Bi-Level Class-Severity-Aware Vision-Language Graph Matching
for Text Guided Medical Image Segmentation
- URL: http://arxiv.org/abs/2305.12231v1
- Date: Sat, 20 May 2023 16:50:45 GMT
- Title: Bi-VLGM : Bi-Level Class-Severity-Aware Vision-Language Graph Matching
for Text Guided Medical Image Segmentation
- Authors: Chen Wenting, Liu Jie and Yuan Yixuan
- Abstract summary: We introduce a Bi-level class-severity-aware Vision-Language Graph Matching (Bi-VLGM) for text guided medical image segmentation.
By exploiting the relation between the local (global) and class (severity) features, the segmentation model can selectively learn the class-aware and severity-aware information.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Medical reports with substantial information can be naturally complementary
to medical images for computer vision tasks, and the modality gap between
vision and language can be solved by vision-language matching (VLM). However,
current vision-language models distort the intra-model relation and mainly
include class information in prompt learning that is insufficient for
segmentation task. In this paper, we introduce a Bi-level class-severity-aware
Vision-Language Graph Matching (Bi-VLGM) for text guided medical image
segmentation, composed of a word-level VLGM module and a sentence-level VLGM
module, to exploit the class-severity-aware relation among visual-textual
features. In word-level VLGM, to mitigate the distorted intra-modal relation
during VLM, we reformulate VLM as graph matching problem and introduce a
vision-language graph matching (VLGM) to exploit the high-order relation among
visual-textual features. Then, we perform VLGM between the local features for
each class region and class-aware prompts to bridge their gap. In
sentence-level VLGM, to provide disease severity information for segmentation
task, we introduce a severity-aware prompting to quantify the severity level of
retinal lesion, and perform VLGM between the global features and the
severity-aware prompts. By exploiting the relation between the local (global)
and class (severity) features, the segmentation model can selectively learn the
class-aware and severity-aware information to promote performance. Extensive
experiments prove the effectiveness of our method and its superiority to
existing methods. Source code is to be released.
Related papers
- Dr-LLaVA: Visual Instruction Tuning with Symbolic Clinical Grounding [53.629132242389716]
Vision-Language Models (VLM) can support clinicians by analyzing medical images and engaging in natural language interactions.
VLMs often exhibit "hallucinogenic" behavior, generating textual outputs not grounded in contextual multimodal information.
We propose a new alignment algorithm that uses symbolic representations of clinical reasoning to ground VLMs in medical knowledge.
arXiv Detail & Related papers (2024-05-29T23:19:28Z) - MLIP: Enhancing Medical Visual Representation with Divergence Encoder
and Knowledge-guided Contrastive Learning [48.97640824497327]
We propose a novel framework leveraging domain-specific medical knowledge as guiding signals to integrate language information into the visual domain through image-text contrastive learning.
Our model includes global contrastive learning with our designed divergence encoder, local token-knowledge-patch alignment contrastive learning, and knowledge-guided category-level contrastive learning with expert knowledge.
Notably, MLIP surpasses state-of-the-art methods even with limited annotated data, highlighting the potential of multimodal pre-training in advancing medical representation learning.
arXiv Detail & Related papers (2024-02-03T05:48:50Z) - Enhancing medical vision-language contrastive learning via
inter-matching relation modelling [14.777259981193726]
Medical image representations can be learned through medical vision-language contrastive learning (mVLCL)
Recent mVLCL methods attempt to align image sub-regions and the report keywords as local-matchings.
We propose a mVLCL method that models the inter-matching relations between local-matchings via a relation-enhanced contrastive learning framework (RECLF)
arXiv Detail & Related papers (2024-01-19T05:28:51Z) - Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models [50.653838482083614]
This paper introduces a scalable test-bed to assess the capabilities of IT-LVLMs on fundamental computer vision tasks.
MERLIM contains over 300K image-question pairs and has a strong focus on detecting cross-modal "hallucination" events in IT-LVLMs.
arXiv Detail & Related papers (2023-12-03T16:39:36Z) - SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language
Guidance [97.00445262074595]
In SemiVL, we propose to integrate rich priors from vision-language models into semi-supervised semantic segmentation.
We design a language-guided decoder to jointly reason over vision and language.
We evaluate SemiVL on 4 semantic segmentation datasets, where it significantly outperforms previous semi-supervised methods.
arXiv Detail & Related papers (2023-11-27T19:00:06Z) - LION : Empowering Multimodal Large Language Model with Dual-Level Visual
Knowledge [58.82222646803248]
Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals.
Most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs, leading to insufficient extraction and reasoning of visual knowledge.
We propose a dual-Level vIsual knedgeOwl eNhanced Multimodal Large Language Model (LION), which empowers the MLLM by injecting visual knowledge in two levels.
arXiv Detail & Related papers (2023-11-20T15:56:44Z) - Qilin-Med-VL: Towards Chinese Large Vision-Language Model for General
Healthcare [14.646414629627001]
This study introduces Qilin-Med-VL, the first Chinese large vision-language model designed to integrate the analysis of textual and visual data.
We also release ChiMed-VL, a dataset consisting of more than 1M image-text pairs.
arXiv Detail & Related papers (2023-10-27T08:05:21Z) - Exploring Transfer Learning in Medical Image Segmentation using Vision-Language Models [0.8878802873945023]
This study introduces the first systematic study on transferring Vision-Language Models to 2D medical images.
Although VLSMs show competitive performance compared to image-only models for segmentation, not all VLSMs utilize the additional information from language prompts.
arXiv Detail & Related papers (2023-08-15T11:28:21Z) - Learning to Exploit Temporal Structure for Biomedical Vision-Language
Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities.
We explicitly account for prior images and reports when available during both training and fine-tuning.
Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.