Learning to Exploit Temporal Structure for Biomedical Vision-Language
Processing
- URL: http://arxiv.org/abs/2301.04558v1
- Date: Wed, 11 Jan 2023 16:35:33 GMT
- Title: Learning to Exploit Temporal Structure for Biomedical Vision-Language
Processing
- Authors: Shruthi Bannur, Stephanie Hyland, Qianchu Liu, Fernando Perez-Garcia,
Maximilian Ilse, Daniel C. Castro, Benedikt Boecking, Harshita Sharma, Kenza
Bouzid, Anja Thieme, Anton Schwaighofer, Maria Wetscherek, Matthew P.
Lungren, Aditya Nori, Javier Alvarez-Valle, Ozan Oktay
- Abstract summary: Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities.
We explicitly account for prior images and reports when available during both training and fine-tuning.
Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
- Score: 53.89917396428747
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised learning in vision-language processing exploits semantic
alignment between imaging and text modalities. Prior work in biomedical VLP has
mostly relied on the alignment of single image and report pairs even though
clinical notes commonly refer to prior images. This does not only introduce
poor alignment between the modalities but also a missed opportunity to exploit
rich self-supervision through existing temporal content in the data. In this
work, we explicitly account for prior images and reports when available during
both training and fine-tuning. Our approach, named BioViL-T, uses a
CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
It is designed to be versatile to arising challenges such as pose variations
and missing input images across time. The resulting model excels on downstream
tasks both in single- and multi-image setups, achieving state-of-the-art
performance on (I) progression classification, (II) phrase grounding, and (III)
report generation, whilst offering consistent improvements on disease
classification and sentence-similarity tasks. We release a novel multi-modal
temporal benchmark dataset, MS-CXR-T, to quantify the quality of
vision-language representations in terms of temporal semantics. Our
experimental results show the advantages of incorporating prior images and
reports to make most use of the data.
Related papers
- PMT: Progressive Mean Teacher via Exploring Temporal Consistency for Semi-Supervised Medical Image Segmentation [51.509573838103854]
We propose a semi-supervised learning framework, termed Progressive Mean Teachers (PMT), for medical image segmentation.
Our PMT generates high-fidelity pseudo labels by learning robust and diverse features in the training process.
Experimental results on two datasets with different modalities, i.e., CT and MRI, demonstrate that our method outperforms the state-of-the-art medical image segmentation approaches.
arXiv Detail & Related papers (2024-09-08T15:02:25Z) - Unlocking the Power of Spatial and Temporal Information in Medical Multimodal Pre-training [99.2891802841936]
We introduce the Med-ST framework for fine-grained spatial and temporal modeling.
For spatial modeling, Med-ST employs the Mixture of View Expert (MoVE) architecture to integrate different visual features from both frontal and lateral views.
For temporal modeling, we propose a novel cross-modal bidirectional cycle consistency objective by forward mapping classification (FMC) and reverse mapping regression (RMR)
arXiv Detail & Related papers (2024-05-30T03:15:09Z) - MLIP: Medical Language-Image Pre-training with Masked Local
Representation Learning [20.33625985769796]
Existing contrastive language-image pre-training aims to learn a joint representation by matching abundant image-text pairs.
We propose a Medical Language-Image Pre-training framework, which exploits the limited image-text medical data more efficiently.
Our evaluation results show that MLIP outperforms previous work in zero/few-shot classification and few-shot segmentation tasks by a large margin.
arXiv Detail & Related papers (2024-01-03T07:54:13Z) - Exploring Transfer Learning in Medical Image Segmentation using Vision-Language Models [0.8878802873945023]
This study introduces the first systematic study on transferring Vision-Language Models to 2D medical images.
Although VLSMs show competitive performance compared to image-only models for segmentation, not all VLSMs utilize the additional information from language prompts.
arXiv Detail & Related papers (2023-08-15T11:28:21Z) - Unified Medical Image-Text-Label Contrastive Learning With Continuous
Prompt [3.218449686637963]
We propose a unified Image-Text-Label contrastive learning framework based on continuous prompts.
We demonstrate through sufficient experiments that the Unified Medical Contrastive Learning framework exhibits excellent performance on several downstream tasks.
arXiv Detail & Related papers (2023-07-12T05:19:10Z) - Multiscale Progressive Text Prompt Network for Medical Image
Segmentation [10.121625177837931]
We propose using progressive text prompts as prior knowledge to guide the segmentation process.
Our model achieves high-quality results with low data annotation costs.
arXiv Detail & Related papers (2023-06-30T23:37:16Z) - Vision-Language Modelling For Radiological Imaging and Reports In The
Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space.
We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains.
Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z) - Generative Negative Text Replay for Continual Vision-Language
Pretraining [95.2784858069843]
Vision-language pre-training has attracted increasing attention recently.
Massive data are usually collected in a streaming fashion.
We propose a multi-modal knowledge distillation between images and texts to align the instance-wise prediction between old and new models.
arXiv Detail & Related papers (2022-10-31T13:42:21Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.