Clinical Camel: An Open Expert-Level Medical Language Model with
  Dialogue-Based Knowledge Encoding
        - URL: http://arxiv.org/abs/2305.12031v2
- Date: Thu, 17 Aug 2023 17:19:02 GMT
- Title: Clinical Camel: An Open Expert-Level Medical Language Model with
  Dialogue-Based Knowledge Encoding
- Authors: Augustin Toma, Patrick R. Lawler, Jimmy Ba, Rahul G. Krishnan, Barry
  B. Rubin, Bo Wang
- Abstract summary: We present Clinical Camel, an open large language model (LLM) explicitly tailored for clinical research.
Fine-tuned from LLaMA-2 using QLoRA, Clinical Camel achieves state-of-the-art performance across medical benchmarks among openly available medical LLMs.
- Score: 31.884600238089405
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract:   We present Clinical Camel, an open large language model (LLM) explicitly
tailored for clinical research. Fine-tuned from LLaMA-2 using QLoRA, Clinical
Camel achieves state-of-the-art performance across medical benchmarks among
openly available medical LLMs. Leveraging efficient single-GPU training,
Clinical Camel surpasses GPT-3.5 in five-shot evaluations on all assessed
benchmarks, including 64.3% on the USMLE Sample Exam (compared to 58.5% for
GPT-3.5), 77.9% on PubMedQA (compared to 60.2%), 60.7% on MedQA (compared to
53.6%), and 54.2% on MedMCQA (compared to 51.0%). In addition to these
benchmarks, Clinical Camel demonstrates its broader capabilities, such as
synthesizing plausible clinical notes. This work introduces dialogue-based
knowledge encoding, a novel method to synthesize conversational data from dense
medical texts. While benchmark results are encouraging, extensive and rigorous
human evaluation across diverse clinical scenarios is imperative to ascertain
safety before implementation. By openly sharing Clinical Camel, we hope to
foster transparent and collaborative research, working towards the safe
integration of LLMs within the healthcare domain. Significant challenges
concerning reliability, bias, and the potential for outdated knowledge persist.
Nonetheless, the transparency provided by an open approach reinforces the
scientific rigor essential for future clinical applications.
 
      
        Related papers
        - A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and   Effectiveness in Clinical Domains [15.73821689524201]
 Large language models (LLMs) hold promise in clinical decision support but face major challenges in safety evaluation and effectiveness validation.<n>We developed the Clinical Safety-Effectiveness Dual-Track Benchmark (CSEDB), a multidimensional framework built on clinical expert consensus.<n>Thirty-two specialist physicians developed and reviewed 2,069 open-ended Q&A items aligned with these criteria, spanning 26 clinical departments to simulate real-world scenarios.
 arXiv  Detail & Related papers  (2025-07-31T12:10:00Z)
- Trustworthy AI for Medicine: Continuous Hallucination Detection and   Elimination with CHECK [1.3638020767676653]
 Large language models (LLMs) show promise in healthcare, but hallucinations remain a major barrier to clinical use.<n>We present CHECK, a continuous-learning framework that integrates structured clinical databases to detect hallucinations.
 arXiv  Detail & Related papers  (2025-06-10T17:12:28Z)
- MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks [47.486705282473984]
 Large language models (LLMs) achieve near-perfect scores on medical exams.<n>These evaluations inadequately reflect complexity and diversity of real-world clinical practice.<n>We introduce MedHELM, an evaluation framework for assessing LLM performance for medical tasks.
 arXiv  Detail & Related papers  (2025-05-26T22:55:49Z)
- Clinical knowledge in LLMs does not translate to human interactions [2.523178830945285]
 We tested if large language models (LLMs) can assist members of the public in identifying underlying conditions and choosing a course of action (disposition) in ten medical scenarios.
Tested alone, LLMs complete the scenarios accurately, correctly identifying conditions in 94.9% of cases and disposition in 56.3% on average.
Participants using the same LLMs identified relevant conditions in less than 34.5% of cases and disposition in less than 44.2%, both no better than the control group.
 arXiv  Detail & Related papers  (2025-04-26T13:32:49Z)
- CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical   Large Language Models in Clinical Scenarios [50.032101237019205]
 CliMedBench is a comprehensive benchmark with 14 expert-guided core clinical scenarios.
The reliability of this benchmark has been confirmed in several ways.
 arXiv  Detail & Related papers  (2024-10-04T15:15:36Z)
- Evaluating the Impact of a Specialized LLM on Physician Experience in   Clinical Decision Support: A Comparison of Ask Avo and ChatGPT-4 [0.3999851878220878]
 Large language models (LLMs) to augment clinical decision support systems is a topic with growing interest.
Current shortcomings such as hallucinations and lack of clear source citations make them unreliable for use in rapidly growing clinical environment.
This study evaluates Ask Avo-derived software by AvoMD that incorporates a proprietary Model Augmented Language Retrieval system.
 arXiv  Detail & Related papers  (2024-09-06T17:53:29Z)
- Towards Evaluating and Building Versatile Large Language Models for   Medicine [57.49547766838095]
 We present MedS-Bench, a benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts.
MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation.
MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks.
 arXiv  Detail & Related papers  (2024-08-22T17:01:34Z)
- Adapting Open-Source Large Language Models for Cost-Effective,   Expert-Level Clinical Note Generation with On-Policy Reinforcement Learning [19.08691249610632]
 This study presents a comprehensive domain- and task-specific adaptation process for the open-source LLaMA-2 13 billion parameter model.
We introduce a new approach, DistillDirect, for performing on-policy reinforcement learning with Gemini 1.0 Pro as the teacher model.
Our model, LLaMA-Clinic, can generate clinical notes comparable in quality to those authored by physicians.
 arXiv  Detail & Related papers  (2024-04-25T15:34:53Z)
- Asclepius: A Spectrum Evaluation Benchmark for Medical Multi-Modal Large
  Language Models [59.60384461302662]
 We introduce Asclepius, a novel benchmark for evaluating Medical Multi-Modal Large Language Models (Med-MLLMs)
Asclepius rigorously and comprehensively assesses model capability in terms of distinct medical specialties and different diagnostic capacities.
We also provide an in-depth analysis of 6 Med-MLLMs and compare them with 5 human specialists.
 arXiv  Detail & Related papers  (2024-02-17T08:04:23Z)
- AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical   Interaction Simulator [69.51568871044454]
 We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs.
This setup allows for realistic assessments of LLMs in clinical scenarios.
We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
 arXiv  Detail & Related papers  (2024-02-15T06:46:48Z)
- LongHealth: A Question Answering Benchmark with Long Clinical Documents [36.05587855811346]
 We present the LongHealth benchmark, comprising 20 detailed fictional patient cases across various diseases.
The benchmark challenges LLMs with 400 multiple-choice questions in three categories: information extraction, negation, and sorting.
We evaluated nine open-source LLMs with a minimum of 16,000 tokens and also included OpenAI's proprietary and cost-efficient GPT-3.5 Turbo for comparison.
 arXiv  Detail & Related papers  (2024-01-25T19:57:00Z)
- Adapted Large Language Models Can Outperform Medical Experts in Clinical   Text Summarization [8.456700096020601]
 Large language models (LLMs) have shown promise in natural language processing (NLP), but their effectiveness on a diverse range of clinical summarization tasks remains unproven.
In this study, we apply adaptation methods to eight LLMs, spanning four distinct clinical summarization tasks.
A clinical reader study with ten physicians evaluates summary, completeness, correctness, and conciseness; in a majority of cases, summaries from our best adapted LLMs are either equivalent (45%) or superior (36%) compared to summaries from medical experts.
 arXiv  Detail & Related papers  (2023-09-14T05:15:01Z)
- MedAlign: A Clinician-Generated Dataset for Instruction Following with
  Electronic Medical Records [60.35217378132709]
 Large language models (LLMs) can follow natural language instructions with human-level fluency.
 evaluating LLMs on realistic text generation tasks for healthcare remains challenging.
We introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data.
 arXiv  Detail & Related papers  (2023-08-27T12:24:39Z)
- Large Language Models Leverage External Knowledge to Extend Clinical
  Insight Beyond Language Boundaries [48.48630043740588]
 Large Language Models (LLMs) such as ChatGPT and Med-PaLM have excelled in various medical question-answering tasks.
We develop a novel in-context learning framework to enhance their performance.
 arXiv  Detail & Related papers  (2023-05-17T12:31:26Z)
- Large Language Models for Healthcare Data Augmentation: An Example on
  Patient-Trial Matching [49.78442796596806]
 We propose an innovative privacy-aware data augmentation approach for patient-trial matching (LLM-PTM)
Our experiments demonstrate a 7.32% average improvement in performance using the proposed LLM-PTM method, and the generalizability to new data is improved by 12.12%.
 arXiv  Detail & Related papers  (2023-03-24T03:14:00Z)
- Large Language Models Encode Clinical Knowledge [21.630872464930587]
 Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation.
We propose a framework for human evaluation of model answers along multiple axes including factuality, precision, possible harm, and bias.
We show that comprehension, recall of knowledge, and medical reasoning improve with model scale and instruction prompt tuning.
 arXiv  Detail & Related papers  (2022-12-26T14:28:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.