CLIPSyntel: CLIP and LLM Synergy for Multimodal Question Summarization
in Healthcare
- URL: http://arxiv.org/abs/2312.11541v1
- Date: Sat, 16 Dec 2023 03:02:05 GMT
- Title: CLIPSyntel: CLIP and LLM Synergy for Multimodal Question Summarization
in Healthcare
- Authors: Akash Ghosh, Arkadeep Acharya, Raghav Jain, Sriparna Saha, Aman
Chadha, Setu Sinha
- Abstract summary: We introduce the Multimodal Medical Question Summarization (MMQS) dataset.
This dataset pairs medical queries with visual aids, facilitating a richer and more nuanced understanding of patient needs.
We also propose a framework, consisting of four modules that identify medical disorders, generate relevant context, filter medical concepts, and craft visually aware summaries.
- Score: 16.033112094191395
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the era of modern healthcare, swiftly generating medical question
summaries is crucial for informed and timely patient care. Despite the
increasing complexity and volume of medical data, existing studies have focused
solely on text-based summarization, neglecting the integration of visual
information. Recognizing the untapped potential of combining textual queries
with visual representations of medical conditions, we introduce the Multimodal
Medical Question Summarization (MMQS) Dataset. This dataset, a major
contribution to our work, pairs medical queries with visual aids, facilitating
a richer and more nuanced understanding of patient needs. We also propose a
framework, utilizing the power of Contrastive Language Image Pretraining(CLIP)
and Large Language Models(LLMs), consisting of four modules that identify
medical disorders, generate relevant context, filter medical concepts, and
craft visually aware summaries. Our comprehensive framework harnesses the power
of CLIP, a multimodal foundation model, and various general-purpose LLMs,
comprising four main modules: the medical disorder identification module, the
relevant context generation module, the context filtration module for
distilling relevant medical concepts and knowledge, and finally, a
general-purpose LLM to generate visually aware medical question summaries.
Leveraging our MMQS dataset, we showcase how visual cues from images enhance
the generation of medically nuanced summaries. This multimodal approach not
only enhances the decision-making process in healthcare but also fosters a more
nuanced understanding of patient queries, laying the groundwork for future
research in personalized and responsive medical care
Related papers
- MC-CoT: A Modular Collaborative CoT Framework for Zero-shot Medical-VQA with LLM and MLLM Integration [36.972533173970554]
multimodal large language models (MLLMs) have been fine-tuned on specific medical image datasets to address medical visual question answering (Med-VQA) tasks.
We introduce MC-CoT, a modular cross-modal collaboration Chain-of-Thought framework designed to enhance the zero-shot performance of MLLMs in Med-VQA.
Our experiments on datasets such as SLAKE, VQA-RAD, and PATH-VQA show that MC-CoT surpasses standalone MLLMs and various multimodality CoT frameworks in recall rate and accuracy.
arXiv Detail & Related papers (2024-10-06T15:28:48Z) - ViKL: A Mammography Interpretation Framework via Multimodal Aggregation of Visual-knowledge-linguistic Features [54.37042005469384]
We announce MVKL, the first multimodal mammography dataset encompassing multi-view images, detailed manifestations and reports.
Based on this dataset, we focus on the challanging task of unsupervised pretraining.
We propose ViKL, a framework that synergizes Visual, Knowledge, and Linguistic features.
arXiv Detail & Related papers (2024-09-24T05:01:23Z) - Dr-LLaVA: Visual Instruction Tuning with Symbolic Clinical Grounding [53.629132242389716]
Vision-Language Models (VLM) can support clinicians by analyzing medical images and engaging in natural language interactions.
VLMs often exhibit "hallucinogenic" behavior, generating textual outputs not grounded in contextual multimodal information.
We propose a new alignment algorithm that uses symbolic representations of clinical reasoning to ground VLMs in medical knowledge.
arXiv Detail & Related papers (2024-05-29T23:19:28Z) - MedKP: Medical Dialogue with Knowledge Enhancement and Clinical Pathway
Encoding [48.348511646407026]
We introduce the Medical dialogue with Knowledge enhancement and clinical Pathway encoding framework.
The framework integrates an external knowledge enhancement module through a medical knowledge graph and an internal clinical pathway encoding via medical entities and physician actions.
arXiv Detail & Related papers (2024-03-11T10:57:45Z) - REALM: RAG-Driven Enhancement of Multimodal Electronic Health Records
Analysis via Large Language Models [19.62552013839689]
Existing models often lack the medical context relevent to clinical tasks, prompting the incorporation of external knowledge.
We propose REALM, a Retrieval-Augmented Generation (RAG) driven framework to enhance multimodal EHR representations.
Our experiments on MIMIC-III mortality and readmission tasks showcase the superior performance of our REALM framework over baselines.
arXiv Detail & Related papers (2024-02-10T18:27:28Z) - MedSumm: A Multimodal Approach to Summarizing Code-Mixed Hindi-English
Clinical Queries [16.101969130235055]
We introduce the Multimodal Medical Codemixed Question Summarization MMCQS dataset.
This dataset combines Hindi-English codemixed medical queries with visual aids.
Our dataset, code, and pre-trained models will be made publicly available.
arXiv Detail & Related papers (2024-01-03T07:58:25Z) - Large Language Models Illuminate a Progressive Pathway to Artificial
Healthcare Assistant: A Review [16.008511195589925]
Large language models (LLMs) have shown promising capabilities in mimicking human-level language comprehension and reasoning.
This paper provides a comprehensive review on the applications and implications of LLMs in medicine.
arXiv Detail & Related papers (2023-11-03T13:51:36Z) - Qilin-Med-VL: Towards Chinese Large Vision-Language Model for General
Healthcare [14.646414629627001]
This study introduces Qilin-Med-VL, the first Chinese large vision-language model designed to integrate the analysis of textual and visual data.
We also release ChiMed-VL, a dataset consisting of more than 1M image-text pairs.
arXiv Detail & Related papers (2023-10-27T08:05:21Z) - Multi-task Paired Masking with Alignment Modeling for Medical
Vision-Language Pre-training [55.56609500764344]
We propose a unified framework based on Multi-task Paired Masking with Alignment (MPMA) to integrate the cross-modal alignment task into the joint image-text reconstruction framework.
We also introduce a Memory-Augmented Cross-Modal Fusion (MA-CMF) module to fully integrate visual information to assist report reconstruction.
arXiv Detail & Related papers (2023-05-13T13:53:48Z) - Towards Medical Artificial General Intelligence via Knowledge-Enhanced
Multimodal Pretraining [121.89793208683625]
Medical artificial general intelligence (MAGI) enables one foundation model to solve different medical tasks.
We propose a new paradigm called Medical-knedge-enhanced mulTimOdal pretRaining (MOTOR)
arXiv Detail & Related papers (2023-04-26T01:26:19Z) - Align, Reason and Learn: Enhancing Medical Vision-and-Language
Pre-training with Knowledge [68.90835997085557]
We propose a systematic and effective approach to enhance structured medical knowledge from three perspectives.
First, we align the representations of the vision encoder and the language encoder through knowledge.
Second, we inject knowledge into the multi-modal fusion model to enable the model to perform reasoning using knowledge as the supplementation of the input image and text.
Third, we guide the model to put emphasis on the most critical information in images and texts by designing knowledge-induced pretext tasks.
arXiv Detail & Related papers (2022-09-15T08:00:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.