Bi-VLGM : Bi-Level Class-Severity-Aware Vision-Language Graph Matching
for Text Guided Medical Image Segmentation
- URL: http://arxiv.org/abs/2305.12231v1
- Date: Sat, 20 May 2023 16:50:45 GMT
- Title: Bi-VLGM : Bi-Level Class-Severity-Aware Vision-Language Graph Matching
for Text Guided Medical Image Segmentation
- Authors: Chen Wenting, Liu Jie and Yuan Yixuan
- Abstract summary: We introduce a Bi-level class-severity-aware Vision-Language Graph Matching (Bi-VLGM) for text guided medical image segmentation.
By exploiting the relation between the local (global) and class (severity) features, the segmentation model can selectively learn the class-aware and severity-aware information.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Medical reports with substantial information can be naturally complementary
to medical images for computer vision tasks, and the modality gap between
vision and language can be solved by vision-language matching (VLM). However,
current vision-language models distort the intra-model relation and mainly
include class information in prompt learning that is insufficient for
segmentation task. In this paper, we introduce a Bi-level class-severity-aware
Vision-Language Graph Matching (Bi-VLGM) for text guided medical image
segmentation, composed of a word-level VLGM module and a sentence-level VLGM
module, to exploit the class-severity-aware relation among visual-textual
features. In word-level VLGM, to mitigate the distorted intra-modal relation
during VLM, we reformulate VLM as graph matching problem and introduce a
vision-language graph matching (VLGM) to exploit the high-order relation among
visual-textual features. Then, we perform VLGM between the local features for
each class region and class-aware prompts to bridge their gap. In
sentence-level VLGM, to provide disease severity information for segmentation
task, we introduce a severity-aware prompting to quantify the severity level of
retinal lesion, and perform VLGM between the global features and the
severity-aware prompts. By exploiting the relation between the local (global)
and class (severity) features, the segmentation model can selectively learn the
class-aware and severity-aware information to promote performance. Extensive
experiments prove the effectiveness of our method and its superiority to
existing methods. Source code is to be released.
Related papers
- GMAT: Grounded Multi-Agent Clinical Description Generation for Text Encoder in Vision-Language MIL for Whole Slide Image Classification [4.922864692096282]
Multiple Instance Learning (MIL) is the leading approach for whole slide image (WSI) classification.<n>Recent work has introduced vision-language models (VLMs) into MIL pipelines to incorporate medical knowledge.<n>We propose a vision-language MIL framework with two key contributions.
arXiv Detail & Related papers (2025-08-02T09:59:39Z) - A Vision-Language Model for Focal Liver Lesion Classification [0.0]
Vision-Language models (VLMs) such as Contrastive Language-Image Pre-training model (CLIP) has been applied to image classifications.<n>We pro-pose a Liver-VLM, a model specifically designed for focal liver lesions (FLLs) classification.
arXiv Detail & Related papers (2025-05-06T09:19:12Z) - From Gaze to Insight: Bridging Human Visual Attention and Vision Language Model Explanation for Weakly-Supervised Medical Image Segmentation [46.99748372216857]
Vision-language models (VLMs) provide semantic context through textual descriptions but lack explanation precision required.
We propose a teacher-student framework that integrates both gaze and language supervision, leveraging their complementary strengths.
Our method achieves Dice scores of 80.78%, 80.53%, and 84.22%, respectively, improving 3-5% over gaze baselines without increasing the annotation burden.
arXiv Detail & Related papers (2025-04-15T16:32:15Z) - BiPVL-Seg: Bidirectional Progressive Vision-Language Fusion with Global-Local Alignment for Medical Image Segmentation [9.262045402495225]
BiPVL-Seg is an end-to-end framework that integrates vision-language fusion and embedding alignment.
BiPVL-Seg introduces progressive fusion in the architecture, which facilitates stage-wise information exchange between vision and text encoders.
It incorporates global-local contrastive alignment, a training objective that enhances the text encoder's comprehension by aligning text and vision embeddings at both class and concept levels.
arXiv Detail & Related papers (2025-03-30T17:34:39Z) - Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding [66.06337890279839]
Large vision-language models (LVLMs) have shown remarkable capabilities in visual-language understanding for downstream multi-modal tasks.
LVLMs still suffer from generating hallucinations in complex generation tasks, leading to inconsistencies between visual inputs and generated content.
We propose an Inter-Modality Correlation Decoding (IMCCD) method to mitigate hallucinations in LVLMs in a training-free manner.
arXiv Detail & Related papers (2025-01-03T17:56:28Z) - CL-HOI: Cross-Level Human-Object Interaction Distillation from Vision Large Language Models [10.62320998365966]
Vision Language Models (VLLMs) can inherently recognize and reason about interactions at the image level but are computationally heavy and not designed for instance-level HOI detection.
We propose a Cross-Level HOI distillation (CL-HOI) framework, which distills instance-level HOIs from VLLMs image-level understanding without the need for manual annotations.
Our approach involves two stages: context distillation, where a Visual Linguistic Translator (VLT) converts visual information into linguistic form, and interaction distillation, where an Interaction Cognition Network (ICN) reasons about spatial, visual, and context relations.
arXiv Detail & Related papers (2024-10-21T05:51:51Z) - LoGra-Med: Long Context Multi-Graph Alignment for Medical Vision-Language Model [55.80651780294357]
State-of-the-art medical multi-modal large language models (med-MLLM) leverage instruction-following data in pre-training.
LoGra-Med is a new multi-graph alignment algorithm that enforces triplet correlations across image modalities, conversation-based descriptions, and extended captions.
Our results show LoGra-Med matches LLAVA-Med performance on 600K image-text pairs for Medical VQA and significantly outperforms it when trained on 10% of the data.
arXiv Detail & Related papers (2024-10-03T15:52:03Z) - ViKL: A Mammography Interpretation Framework via Multimodal Aggregation of Visual-knowledge-linguistic Features [54.37042005469384]
We announce MVKL, the first multimodal mammography dataset encompassing multi-view images, detailed manifestations and reports.
Based on this dataset, we focus on the challanging task of unsupervised pretraining.
We propose ViKL, a framework that synergizes Visual, Knowledge, and Linguistic features.
arXiv Detail & Related papers (2024-09-24T05:01:23Z) - MLIP: Enhancing Medical Visual Representation with Divergence Encoder
and Knowledge-guided Contrastive Learning [48.97640824497327]
We propose a novel framework leveraging domain-specific medical knowledge as guiding signals to integrate language information into the visual domain through image-text contrastive learning.
Our model includes global contrastive learning with our designed divergence encoder, local token-knowledge-patch alignment contrastive learning, and knowledge-guided category-level contrastive learning with expert knowledge.
Notably, MLIP surpasses state-of-the-art methods even with limited annotated data, highlighting the potential of multimodal pre-training in advancing medical representation learning.
arXiv Detail & Related papers (2024-02-03T05:48:50Z) - Enhancing medical vision-language contrastive learning via
inter-matching relation modelling [14.777259981193726]
Medical image representations can be learned through medical vision-language contrastive learning (mVLCL)
Recent mVLCL methods attempt to align image sub-regions and the report keywords as local-matchings.
We propose a mVLCL method that models the inter-matching relations between local-matchings via a relation-enhanced contrastive learning framework (RECLF)
arXiv Detail & Related papers (2024-01-19T05:28:51Z) - Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models [50.653838482083614]
This paper introduces a scalable test-bed to assess the capabilities of IT-LVLMs on fundamental computer vision tasks.
MERLIM contains over 300K image-question pairs and has a strong focus on detecting cross-modal "hallucination" events in IT-LVLMs.
arXiv Detail & Related papers (2023-12-03T16:39:36Z) - SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language
Guidance [97.00445262074595]
In SemiVL, we propose to integrate rich priors from vision-language models into semi-supervised semantic segmentation.
We design a language-guided decoder to jointly reason over vision and language.
We evaluate SemiVL on 4 semantic segmentation datasets, where it significantly outperforms previous semi-supervised methods.
arXiv Detail & Related papers (2023-11-27T19:00:06Z) - LION : Empowering Multimodal Large Language Model with Dual-Level Visual
Knowledge [58.82222646803248]
Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals.
Most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs, leading to insufficient extraction and reasoning of visual knowledge.
We propose a dual-Level vIsual knedgeOwl eNhanced Multimodal Large Language Model (LION), which empowers the MLLM by injecting visual knowledge in two levels.
arXiv Detail & Related papers (2023-11-20T15:56:44Z) - Exploring Transfer Learning in Medical Image Segmentation using Vision-Language Models [0.8878802873945023]
This study introduces the first systematic study on transferring Vision-Language Models to 2D medical images.
Although VLSMs show competitive performance compared to image-only models for segmentation, not all VLSMs utilize the additional information from language prompts.
arXiv Detail & Related papers (2023-08-15T11:28:21Z) - Learning to Exploit Temporal Structure for Biomedical Vision-Language
Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities.
We explicitly account for prior images and reports when available during both training and fine-tuning.
Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.