Medical Phrase Grounding with Region-Phrase Context Contrastive
Alignment
- URL: http://arxiv.org/abs/2303.07618v1
- Date: Tue, 14 Mar 2023 03:57:16 GMT
- Title: Medical Phrase Grounding with Region-Phrase Context Contrastive
Alignment
- Authors: Zhihao Chen, Yang Zhou, Anh Tran, Junting Zhao, Liang Wan, Gideon Ooi,
Lionel Cheng, Choon Hua Thng, Xinxing Xu, Yong Liu, Huazhu Fu
- Abstract summary: Medical phrase grounding aims to locate the most relevant region in a medical image, given a phrase query describing certain medical findings.
In this paper, we propose MedRPG, an end-to-end approach for MPG.
To enable MedRPG to locate nuanced medical findings with better region-phrase correspondences, we further propose Tri-attention Context contrastive alignment (TaCo)
- Score: 35.56193044201645
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Medical phrase grounding (MPG) aims to locate the most relevant region in a
medical image, given a phrase query describing certain medical findings, which
is an important task for medical image analysis and radiological diagnosis.
However, existing visual grounding methods rely on general visual features for
identifying objects in natural images and are not capable of capturing the
subtle and specialized features of medical findings, leading to sub-optimal
performance in MPG. In this paper, we propose MedRPG, an end-to-end approach
for MPG. MedRPG is built on a lightweight vision-language transformer encoder
and directly predicts the box coordinates of mentioned medical findings, which
can be trained with limited medical data, making it a valuable tool in medical
image analysis. To enable MedRPG to locate nuanced medical findings with better
region-phrase correspondences, we further propose Tri-attention Context
contrastive alignment (TaCo). TaCo seeks context alignment to pull both the
features and attention outputs of relevant region-phrase pairs close together
while pushing those of irrelevant regions far away. This ensures that the final
box prediction depends more on its finding-specific regions and phrases.
Experimental results on three MPG datasets demonstrate that our MedRPG
outperforms state-of-the-art visual grounding approaches by a large margin.
Additionally, the proposed TaCo strategy is effective in enhancing finding
localization ability and reducing spurious region-phrase correlations.
Related papers
- Unlocking the Power of Spatial and Temporal Information in Medical Multimodal Pre-training [99.2891802841936]
We introduce the Med-ST framework for fine-grained spatial and temporal modeling.
For spatial modeling, Med-ST employs the Mixture of View Expert (MoVE) architecture to integrate different visual features from both frontal and lateral views.
For temporal modeling, we propose a novel cross-modal bidirectional cycle consistency objective by forward mapping classification (FMC) and reverse mapping regression (RMR)
arXiv Detail & Related papers (2024-05-30T03:15:09Z) - Content-Based Image Retrieval for Multi-Class Volumetric Radiology Images: A Benchmark Study [0.6249768559720122]
We benchmark embeddings derived from pre-trained supervised models on medical images against embeddings derived from pre-trained unsupervised models on non-medical images.
For volumetric image retrieval, we adopt a late interaction re-ranking method inspired by text matching.
arXiv Detail & Related papers (2024-05-15T13:34:07Z) - Grounded Knowledge-Enhanced Medical VLP for Chest X-Ray [12.239249676716247]
Medical vision-language pre-training has emerged as a promising approach for learning domain-general representations of medical image and text.
We propose a grounded knowledge-enhanced medical vision-language pre-training framework for chest X-ray.
Our results show the advantage of incorporating grounding mechanism to remove biases and improve the alignment between chest X-ray image and radiology report.
arXiv Detail & Related papers (2024-04-23T05:16:24Z) - MedRG: Medical Report Grounding with Multi-modal Large Language Model [42.04042642085121]
Medical Report Grounding (MedRG) is an end-to-end solution for utilizing a multi-modal Large Language Model to predict key phrase.
The experimental results validate the effectiveness of MedRG, surpassing the performance of the existing state-of-the-art medical phrase grounding methods.
arXiv Detail & Related papers (2024-04-10T07:41:35Z) - Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning [65.54680361074882]
Eye-gaze Guided Multi-modal Alignment (EGMA) framework harnesses eye-gaze data for better alignment of medical visual and textual features.
We conduct downstream tasks of image classification and image-text retrieval on four medical datasets.
arXiv Detail & Related papers (2024-03-19T03:59:14Z) - Cross-Modal Causal Intervention for Medical Report Generation [109.83549148448469]
Medical report generation (MRG) is essential for computer-aided diagnosis and medication guidance.
Due to the spurious correlations within image-text data induced by visual and linguistic biases, it is challenging to generate accurate reports reliably describing lesion areas.
We propose a novel Visual-Linguistic Causal Intervention (VLCI) framework for MRG, which consists of a visual deconfounding module (VDM) and a linguistic deconfounding module (LDM)
arXiv Detail & Related papers (2023-03-16T07:23:55Z) - TarGAN: Target-Aware Generative Adversarial Networks for Multi-modality
Medical Image Translation [4.333115837538408]
We propose a novel target-aware generative adversarial network called TarGAN.
TarGAN is capable of learning multi-modality medical image translation without relying on paired data.
Experiments on both quantitative measures and qualitative evaluations demonstrate that TarGAN outperforms the state-of-the-art methods in all cases.
arXiv Detail & Related papers (2021-05-19T08:45:33Z) - Auxiliary Signal-Guided Knowledge Encoder-Decoder for Medical Report
Generation [107.3538598876467]
We propose an Auxiliary Signal-Guided Knowledge-Decoder (ASGK) to mimic radiologists' working patterns.
ASGK integrates internal visual feature fusion and external medical linguistic information to guide medical knowledge transfer and learning.
arXiv Detail & Related papers (2020-06-06T01:00:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.