Contextualized Keyword Representations for Multi-modal Retinal Image
Captioning
- URL: http://arxiv.org/abs/2104.12471v1
- Date: Mon, 26 Apr 2021 11:08:13 GMT
- Title: Contextualized Keyword Representations for Multi-modal Retinal Image
Captioning
- Authors: Jia-Hong Huang, Ting-Wei Wu, Marcel Worring
- Abstract summary: A traditional medical image captioning model creates a medical description only based on a single medical image input.
A new end-to-end deep multi-modal medical image captioning model is proposed.
- Score: 16.553644007702808
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Medical image captioning automatically generates a medical description to
describe the content of a given medical image. A traditional medical image
captioning model creates a medical description only based on a single medical
image input. Hence, an abstract medical description or concept is hard to be
generated based on the traditional approach. Such a method limits the
effectiveness of medical image captioning. Multi-modal medical image captioning
is one of the approaches utilized to address this problem. In multi-modal
medical image captioning, textual input, e.g., expert-defined keywords, is
considered as one of the main drivers of medical description generation. Thus,
encoding the textual input and the medical image effectively are both important
for the task of multi-modal medical image captioning. In this work, a new
end-to-end deep multi-modal medical image captioning model is proposed.
Contextualized keyword representations, textual feature reinforcement, and
masked self-attention are used to develop the proposed approach. Based on the
evaluation of the existing multi-modal medical image captioning dataset,
experimental results show that the proposed model is effective with the
increase of +53.2% in BLEU-avg and +18.6% in CIDEr, compared with the
state-of-the-art method.
Related papers
- ZALM3: Zero-Shot Enhancement of Vision-Language Alignment via In-Context Information in Multi-Turn Multimodal Medical Dialogue [25.398370966763597]
In our online medical consultation scenario, a doctor responds to the texts and images provided by a patient in multiple rounds to diagnose her/his health condition.
Unlike high-quality images captured by professional equipment in traditional medical visual question answering (Med-VQA), the images in our case are taken by patients' mobile phones.
We propose ZALM3, a Zero-shot strategy to improve vision-language alignment in Multi-turn Multimodal Medical dialogue.
arXiv Detail & Related papers (2024-09-26T07:55:57Z) - Autoregressive Sequence Modeling for 3D Medical Image Representation [48.706230961589924]
We introduce a pioneering method for learning 3D medical image representations through an autoregressive sequence pre-training framework.
Our approach various 3D medical images based on spatial, contrast, and semantic correlations, treating them as interconnected visual tokens within a token sequence.
arXiv Detail & Related papers (2024-09-13T10:19:10Z) - MedRG: Medical Report Grounding with Multi-modal Large Language Model [42.04042642085121]
Medical Report Grounding (MedRG) is an end-to-end solution for utilizing a multi-modal Large Language Model to predict key phrase.
The experimental results validate the effectiveness of MedRG, surpassing the performance of the existing state-of-the-art medical phrase grounding methods.
arXiv Detail & Related papers (2024-04-10T07:41:35Z) - Multimodal Foundation Models Exploit Text to Make Medical Image Predictions [3.4230952713864373]
We evaluate the mechanisms by which multimodal foundation models integrate and prioritize different data modalities, including images and text.
Our results suggest that multimodal AI models may be useful in medical diagnostic reasoning but that their accuracy is largely driven, for better and worse, by their exploitation of text.
arXiv Detail & Related papers (2023-11-09T18:48:02Z) - Sam-Guided Enhanced Fine-Grained Encoding with Mixed Semantic Learning
for Medical Image Captioning [12.10183458424711]
We present a novel medical image captioning method guided by the segment anything model (SAM)
Our approach employs a distinctive pre-training strategy with mixed semantic learning to simultaneously capture both the overall information and finer details within medical images.
arXiv Detail & Related papers (2023-11-02T05:44:13Z) - BiomedJourney: Counterfactual Biomedical Image Generation by
Instruction-Learning from Multimodal Patient Journeys [99.7082441544384]
We present BiomedJourney, a novel method for counterfactual biomedical image generation by instruction-learning.
We use GPT-4 to process the corresponding imaging reports and generate a natural language description of disease progression.
The resulting triples are then used to train a latent diffusion model for counterfactual biomedical image generation.
arXiv Detail & Related papers (2023-10-16T18:59:31Z) - Towards Unifying Medical Vision-and-Language Pre-training via Soft
Prompts [63.84720380390935]
There exist two typical types, textiti.e., the fusion-encoder type and the dual-encoder type, depending on whether a heavy fusion module is used.
We propose an effective yet straightforward scheme named PTUnifier to unify the two types.
We first unify the input format by introducing visual and textual prompts, which serve as a feature bank that stores the most representative images/texts.
arXiv Detail & Related papers (2023-02-17T15:43:42Z) - Learning to Exploit Temporal Structure for Biomedical Vision-Language
Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities.
We explicitly account for prior images and reports when available during both training and fine-tuning.
Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z) - Multi-Modal Masked Autoencoders for Medical Vision-and-Language
Pre-Training [62.215025958347105]
We propose a self-supervised learning paradigm with multi-modal masked autoencoders.
We learn cross-modal domain knowledge by reconstructing missing pixels and tokens from randomly masked images and texts.
arXiv Detail & Related papers (2022-09-15T07:26:43Z) - Towards more patient friendly clinical notes through language models and
ontologies [57.51898902864543]
We present a novel approach to automated medical text based on word simplification and language modelling.
We use a new dataset pairs of publicly available medical sentences and a version of them simplified by clinicians.
Our method based on a language model trained on medical forum data generates simpler sentences while preserving both grammar and the original meaning.
arXiv Detail & Related papers (2021-12-23T16:11:19Z) - Discriminative Cross-Modal Data Augmentation for Medical Imaging
Applications [24.06277026586584]
Deep learning methods have shown great success in medical image analysis, they require a number of medical images to train.
Due to data privacy concerns and unavailability of medical annotators, it is oftentimes very difficult to obtain a lot of labeled medical images for model training.
We propose a discriminative unpaired image-to-image translation model which translates images in source modality into images in target modality.
arXiv Detail & Related papers (2020-10-07T15:07:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.