Dr-LLaVA: Visual Instruction Tuning with Symbolic Clinical Grounding
- URL: http://arxiv.org/abs/2405.19567v2
- Date: Thu, 10 Oct 2024 07:47:21 GMT
- Title: Dr-LLaVA: Visual Instruction Tuning with Symbolic Clinical Grounding
- Authors: Shenghuan Sun, Alexander Schubert, Gregory M. Goldgof, Zhiqing Sun, Thomas Hartvigsen, Atul J. Butte, Ahmed Alaa,
- Abstract summary: Vision-Language Models (VLM) can support clinicians by analyzing medical images and engaging in natural language interactions.
VLMs often exhibit "hallucinogenic" behavior, generating textual outputs not grounded in contextual multimodal information.
We propose a new alignment algorithm that uses symbolic representations of clinical reasoning to ground VLMs in medical knowledge.
- Score: 53.629132242389716
- License:
- Abstract: Vision-Language Models (VLM) can support clinicians by analyzing medical images and engaging in natural language interactions to assist in diagnostic and treatment tasks. However, VLMs often exhibit "hallucinogenic" behavior, generating textual outputs not grounded in contextual multimodal information. This challenge is particularly pronounced in the medical domain, where we do not only require VLM outputs to be accurate in single interactions but also to be consistent with clinical reasoning and diagnostic pathways throughout multi-turn conversations. For this purpose, we propose a new alignment algorithm that uses symbolic representations of clinical reasoning to ground VLMs in medical knowledge. These representations are utilized to (i) generate GPT-4-guided visual instruction tuning data at scale, simulating clinician-VLM conversations with demonstrations of clinical reasoning, and (ii) create an automatic reward function that evaluates the clinical validity of VLM generations throughout clinician-VLM interactions. Our algorithm eliminates the need for human involvement in training data generation or reward model construction, reducing costs compared to standard reinforcement learning with human feedback (RLHF). We apply our alignment algorithm to develop Dr-LLaVA, a conversational VLM finetuned for analyzing bone marrow pathology slides, demonstrating strong performance in multi-turn medical conversations.
Related papers
- PALLM: Evaluating and Enhancing PALLiative Care Conversations with Large Language Models [10.258261180305439]
Large language models (LLMs) offer a new approach to assessing complex communication metrics.
LLMs offer the potential to advance the field through integration into passive sensing and just-in-time intervention systems.
This study explores LLMs as evaluators of palliative care communication quality, leveraging their linguistic, in-context learning, and reasoning capabilities.
arXiv Detail & Related papers (2024-09-23T16:39:12Z) - MedTsLLM: Leveraging LLMs for Multimodal Medical Time Series Analysis [6.30440420617113]
We introduce MedTsLLM, a general multimodal large language model (LLM) framework that integrates time series data and rich contextual information in the form of text to analyze physiological signals.
We perform three tasks with clinical relevance: semantic segmentation, boundary detection, and anomaly detection in time series.
Our model outperforms state-of-the-art baselines, including deep learning models, other LLMs, and clinical methods across multiple medical domains.
arXiv Detail & Related papers (2024-08-14T18:57:05Z) - Leveraging Large Language Model as Simulated Patients for Clinical Education [18.67200160979337]
High cost of training and hiring qualified SPs limit students' access to this type of clinical training.
With the rapid development of Large Language Models (LLMs), their exceptional capabilities in conversational artificial intelligence and role-playing have been demonstrated.
We present an integrated model-agnostic framework called CureFun that harnesses the potential of LLMs in clinical medical education.
arXiv Detail & Related papers (2024-04-13T06:36:32Z) - MedFLIP: Medical Vision-and-Language Self-supervised Fast Pre-Training with Masked Autoencoder [26.830574964308962]
We introduce MedFLIP, a Fast Language-Image Pre-training method for Medical analysis.
We explore MAEs for zero-shot learning with crossed domains, which enhances the model's ability to learn from limited data.
Lastly, we validate using language will improve the zero-shot performance for the medical image analysis.
arXiv Detail & Related papers (2024-03-07T16:11:43Z) - AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs.
This setup allows for realistic assessments of LLMs in clinical scenarios.
We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z) - Radiology Report Generation Using Transformers Conditioned with
Non-imaging Data [55.17268696112258]
This paper proposes a novel multi-modal transformer network that integrates chest x-ray (CXR) images and associated patient demographic information.
The proposed network uses a convolutional neural network to extract visual features from CXRs and a transformer-based encoder-decoder network that combines the visual features with semantic text embeddings of patient demographic information.
arXiv Detail & Related papers (2023-11-18T14:52:26Z) - Show from Tell: Audio-Visual Modelling in Clinical Settings [58.88175583465277]
We consider audio-visual modelling in a clinical setting, providing a solution to learn medical representations without human expert annotation.
A simple yet effective multi-modal self-supervised learning framework is proposed for this purpose.
The proposed approach is able to localise anatomical regions of interest during ultrasound imaging, with only speech audio as a reference.
arXiv Detail & Related papers (2023-10-25T08:55:48Z) - VBridge: Connecting the Dots Between Features, Explanations, and Data
for Healthcare Models [85.4333256782337]
VBridge is a visual analytics tool that seamlessly incorporates machine learning explanations into clinicians' decision-making workflow.
We identified three key challenges, including clinicians' unfamiliarity with ML features, lack of contextual information, and the need for cohort-level evidence.
We demonstrated the effectiveness of VBridge through two case studies and expert interviews with four clinicians.
arXiv Detail & Related papers (2021-08-04T17:34:13Z) - Self-supervised Answer Retrieval on Clinical Notes [68.87777592015402]
We introduce CAPR, a rule-based self-supervision objective for training Transformer language models for domain-specific passage matching.
We apply our objective in four Transformer-based architectures: Contextual Document Vectors, Bi-, Poly- and Cross-encoders.
We report that CAPR outperforms strong baselines in the retrieval of domain-specific passages and effectively generalizes across rule-based and human-labeled passages.
arXiv Detail & Related papers (2021-08-02T10:42:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.