Related papers: STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical Question-Answering

STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical Question-Answering

URL: http://arxiv.org/abs/2406.19973v2
Date: Thu, 24 Oct 2024 18:47:37 GMT
Title: STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical Question-Answering
Authors: Guohao Sun, Can Qin, Huazhu Fu, Linwei Wang, Zhiqiang Tao,
Abstract summary: STLLaVA-Med is designed to train a policy model capable of auto-generating medical visual instruction data. We validate the efficacy and data efficiency of STLLaVA-Med across three major medical Visual Question Answering (VQA) benchmarks.
Score: 58.79671189792399
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Vision-Language Models (LVLMs) have shown significant potential in assisting medical diagnosis by leveraging extensive biomedical datasets. However, the advancement of medical image understanding and reasoning critically depends on building high-quality visual instruction data, which is costly and labor-intensive to obtain, particularly in the medical domain. To mitigate this data-starving issue, we introduce Self-Training Large Language and Vision Assistant for Medicine (STLLaVA-Med). The proposed method is designed to train a policy model (an LVLM) capable of auto-generating medical visual instruction data to improve data efficiency, guided through Direct Preference Optimization (DPO). Specifically, a more powerful and larger LVLM (e.g., GPT-4o) is involved as a biomedical expert to oversee the DPO fine-tuning process on the auto-generated data, encouraging the policy model to align efficiently with human preferences. We validate the efficacy and data efficiency of STLLaVA-Med across three major medical Visual Question Answering (VQA) benchmarks, demonstrating competitive zero-shot performance with the utilization of only 9% of the medical data.

Related papers

GEMeX-ThinkVG: Towards Thinking with Visual Grounding in Medical VQA via Reinforcement Learning [50.94508930739623]
Medical visual question answering aims to support clinical decision-making by enabling models to answer natural language questions based on medical images.<n>Current methods still suffer from limited answer reliability and poor interpretability, impairing the ability of clinicians and patients to understand and trust model-generated answers.<n>This work first proposes a Thinking with Visual Grounding dataset wherein the answer generation is decomposed into intermediate reasoning steps.<n>We introduce a novel verifiable reward mechanism for reinforcement learning to guide post-training, improving the alignment between the model's reasoning process and its final answer.
arXiv Detail & Related papers (2025-06-22T08:09:58Z)
Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning [57.873833577058]
We build a multimodal dataset enriched with extensive medical knowledge.<n>We then introduce our medical-specialized MLLM: Lingshu.<n>Lingshu undergoes multi-stage training to embed medical expertise and enhance its task-solving capabilities.
arXiv Detail & Related papers (2025-06-08T08:47:30Z)
Confounder-Aware Medical Data Selection for Fine-Tuning Pretrained Vision Models [11.10271979935354]
We present a confounder-aware medical data selection approach for medical dataset curation. Our approach first identifies confounding variables within data and then develops a distance-based data selection strategy. We validate the superiority of our approach through extensive experiments across diverse medical imaging modalities.
arXiv Detail & Related papers (2025-03-02T05:50:25Z)
FineMedLM-o1: Enhancing the Medical Reasoning Ability of LLM from Supervised Fine-Tuning to Test-Time Training [12.1175788614508]
FineMedLM-o1 is a large language model for medical reasoning. It uses high-quality synthetic medical data and long-form reasoning data forSupervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) We also introduced Test-Time Training (TTT) in the medical domain for the first time, facilitating domain adaptation and ensuring reliable, accurate reasoning.
arXiv Detail & Related papers (2025-01-16T00:19:19Z)
IIMedGPT: Promoting Large Language Model Capabilities of Medical Tasks by Efficient Human Preference Alignment [6.022433954095106]
We introduce a medical instruction dataset, CMedINS, containing six medical instructions derived from actual medical tasks. We then launch our medical model, IIMedGPT, employing an efficient preference alignment method. The results show that our final model outperforms existing medical models in medical dialogue.
arXiv Detail & Related papers (2025-01-06T09:22:36Z)
A Survey of Medical Vision-and-Language Applications and Their Techniques [48.268198631277315]
Medical vision-and-language models (MVLMs) have attracted substantial interest due to their capability to offer a natural language interface for interpreting complex medical data. Here, we provide a comprehensive overview of MVLMs and the various medical tasks to which they have been applied. We also examine the datasets used for these tasks and compare the performance of different models based on standardized evaluation metrics.
arXiv Detail & Related papers (2024-11-19T03:27:05Z)
LoGra-Med: Long Context Multi-Graph Alignment for Medical Vision-Language Model [55.80651780294357]
State-of-the-art medical multi-modal large language models (med-MLLM) leverage instruction-following data in pre-training. LoGra-Med is a new multi-graph alignment algorithm that enforces triplet correlations across image modalities, conversation-based descriptions, and extended captions. Our results show LoGra-Med matches LLAVA-Med performance on 600K image-text pairs for Medical VQA and significantly outperforms it when trained on 10% of the data.
arXiv Detail & Related papers (2024-10-03T15:52:03Z)
Enhancing Healthcare through Large Language Models: A Study on Medical Question Answering [13.237829215746443]
Sentence-t5 + Mistral 7B model excels in understanding and generating precise medical answers. Our findings highlight the potential of integrating sophisticated LLMs in medical contexts.
arXiv Detail & Related papers (2024-08-08T00:35:39Z)
HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale [29.956053068653734]
We create the PubMedVision dataset with 1.3 million medical VQA samples. Using PubMedVision, we train a 34B medical MLLM HuatuoGPT-Vision, which shows superior performance in medical multimodal scenarios.
arXiv Detail & Related papers (2024-06-27T15:50:41Z)
README: Bridging Medical Jargon and Lay Understanding for Patient Education through Data-Centric NLP [9.432205523734707]
We introduce a new task of automatically generating lay definitions, aiming to simplify medical terms into patient-friendly lay language. We first created the dataset, an extensive collection of over 50,000 unique (medical term, lay definition) pairs and 300,000 mentions. We have also engineered a data-centric Human-AI pipeline that synergizes data filtering, augmentation, and selection to improve data quality.
arXiv Detail & Related papers (2023-12-24T23:01:00Z)
LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets. We have collected approximately 1.3 million medical images from 55 publicly available datasets. LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z)
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day [85.19963303642427]
We propose a cost-efficient approach for training a vision-language conversational assistant that can answer open-ended research questions of biomedical images. The model first learns to align biomedical vocabulary using the figure-caption pairs as is, then learns to master open-ended conversational semantics. This enables us to train a Large Language and Vision Assistant for BioMedicine in less than 15 hours (with eight A100s)
arXiv Detail & Related papers (2023-06-01T16:50:07Z)
Robust and Efficient Medical Imaging with Self-Supervision [80.62711706785834]
We present REMEDIS, a unified representation learning strategy to improve robustness and data-efficiency of medical imaging AI. We study a diverse range of medical imaging tasks and simulate three realistic application scenarios using retrospective data.
arXiv Detail & Related papers (2022-05-19T17:34:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.