Qilin-Med-VL: Towards Chinese Large Vision-Language Model for General
Healthcare
- URL: http://arxiv.org/abs/2310.17956v2
- Date: Wed, 1 Nov 2023 07:10:23 GMT
- Title: Qilin-Med-VL: Towards Chinese Large Vision-Language Model for General
Healthcare
- Authors: Junling Liu, Ziming Wang, Qichen Ye, Dading Chong, Peilin Zhou, Yining
Hua
- Abstract summary: This study introduces Qilin-Med-VL, the first Chinese large vision-language model designed to integrate the analysis of textual and visual data.
We also release ChiMed-VL, a dataset consisting of more than 1M image-text pairs.
- Score: 14.646414629627001
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have introduced a new era of proficiency in
comprehending complex healthcare and biomedical topics. However, there is a
noticeable lack of models in languages other than English and models that can
interpret multi-modal input, which is crucial for global healthcare
accessibility. In response, this study introduces Qilin-Med-VL, the first
Chinese large vision-language model designed to integrate the analysis of
textual and visual data. Qilin-Med-VL combines a pre-trained Vision Transformer
(ViT) with a foundational LLM. It undergoes a thorough two-stage curriculum
training process that includes feature alignment and instruction tuning. This
method enhances the model's ability to generate medical captions and answer
complex medical queries. We also release ChiMed-VL, a dataset consisting of
more than 1M image-text pairs. This dataset has been carefully curated to
enable detailed and comprehensive interpretation of medical data using various
types of images.
Related papers
- LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound [7.941670191244354]
We propose a fine-grained adaptive VLM architecture for Chinese medical visual conversations through parameter-efficient tuning.
Specifically, we devise a fusion module with fine-grained vision encoders to achieve enhancement for subtle medical visual semantics.
For execution, we leverage a large-scale multimodal Chinese ultrasound dataset obtained from the hospital.
arXiv Detail & Related papers (2024-10-19T11:38:31Z) - LoGra-Med: Long Context Multi-Graph Alignment for Medical Vision-Language Model [55.80651780294357]
State-of-the-art medical multi-modal large language models (med-MLLM) leverage instruction-following data in pre-training.
LoGra-Med is a new multi-graph alignment algorithm that enforces triplet correlations across image modalities, conversation-based descriptions, and extended captions.
Our results show LoGra-Med matches LLAVA-Med performance on 600K image-text pairs for Medical VQA and significantly outperforms it when trained on 10% of the data.
arXiv Detail & Related papers (2024-10-03T15:52:03Z) - ViKL: A Mammography Interpretation Framework via Multimodal Aggregation of Visual-knowledge-linguistic Features [54.37042005469384]
We announce MVKL, the first multimodal mammography dataset encompassing multi-view images, detailed manifestations and reports.
Based on this dataset, we focus on the challanging task of unsupervised pretraining.
We propose ViKL, a framework that synergizes Visual, Knowledge, and Linguistic features.
arXiv Detail & Related papers (2024-09-24T05:01:23Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - Vision-Language Models for Medical Report Generation and Visual Question Answering: A Review [0.0]
Medical vision-language models (VLMs) combine computer vision (CV) and natural language processing (NLP) to analyze medical data.
Our paper reviews recent advancements in developing models designed for medical report generation and visual question answering.
arXiv Detail & Related papers (2024-03-04T20:29:51Z) - MedSumm: A Multimodal Approach to Summarizing Code-Mixed Hindi-English
Clinical Queries [16.101969130235055]
We introduce the Multimodal Medical Codemixed Question Summarization MMCQS dataset.
This dataset combines Hindi-English codemixed medical queries with visual aids.
Our dataset, code, and pre-trained models will be made publicly available.
arXiv Detail & Related papers (2024-01-03T07:58:25Z) - CLIPSyntel: CLIP and LLM Synergy for Multimodal Question Summarization
in Healthcare [16.033112094191395]
We introduce the Multimodal Medical Question Summarization (MMQS) dataset.
This dataset pairs medical queries with visual aids, facilitating a richer and more nuanced understanding of patient needs.
We also propose a framework, consisting of four modules that identify medical disorders, generate relevant context, filter medical concepts, and craft visually aware summaries.
arXiv Detail & Related papers (2023-12-16T03:02:05Z) - Masked Vision and Language Pre-training with Unimodal and Multimodal
Contrastive Losses for Medical Visual Question Answering [7.669872220702526]
We present a novel self-supervised approach that learns unimodal and multimodal feature representations of input images and text.
The proposed approach achieves state-of-the-art (SOTA) performance on three publicly available medical VQA datasets.
arXiv Detail & Related papers (2023-07-11T15:00:11Z) - Customizing General-Purpose Foundation Models for Medical Report
Generation [64.31265734687182]
The scarcity of labelled medical image-report pairs presents great challenges in the development of deep and large-scale neural networks.
We propose customizing off-the-shelf general-purpose large-scale pre-trained models, i.e., foundation models (FMs) in computer vision and natural language processing.
arXiv Detail & Related papers (2023-06-09T03:02:36Z) - LLaVA-Med: Training a Large Language-and-Vision Assistant for
Biomedicine in One Day [85.19963303642427]
We propose a cost-efficient approach for training a vision-language conversational assistant that can answer open-ended research questions of biomedical images.
The model first learns to align biomedical vocabulary using the figure-caption pairs as is, then learns to master open-ended conversational semantics.
This enables us to train a Large Language and Vision Assistant for BioMedicine in less than 15 hours (with eight A100s)
arXiv Detail & Related papers (2023-06-01T16:50:07Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.