Closing the gap between open-source and commercial large language models for medical evidence summarization
- URL: http://arxiv.org/abs/2408.00588v1
- Date: Thu, 25 Jul 2024 05:03:01 GMT
- Title: Closing the gap between open-source and commercial large language models for medical evidence summarization
- Authors: Gongbo Zhang, Qiao Jin, Yiliang Zhou, Song Wang, Betina R. Idnay, Yiming Luo, Elizabeth Park, Jordan G. Nestor, Matthew E. Spotnitz, Ali Soroush, Thomas Campion, Zhiyong Lu, Chunhua Weng, Yifan Peng,
- Abstract summary: Large language models (LLMs) hold great promise in summarizing medical evidence.
Most recent studies focus on the application of proprietary LLMs.
While open-source LLMs allow better transparency and customization, their performance falls short compared to proprietary ones.
- Score: 20.60798771155072
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large language models (LLMs) hold great promise in summarizing medical evidence. Most recent studies focus on the application of proprietary LLMs. Using proprietary LLMs introduces multiple risk factors, including a lack of transparency and vendor dependency. While open-source LLMs allow better transparency and customization, their performance falls short compared to proprietary ones. In this study, we investigated to what extent fine-tuning open-source LLMs can further improve their performance in summarizing medical evidence. Utilizing a benchmark dataset, MedReview, consisting of 8,161 pairs of systematic reviews and summaries, we fine-tuned three broadly-used, open-sourced LLMs, namely PRIMERA, LongT5, and Llama-2. Overall, the fine-tuned LLMs obtained an increase of 9.89 in ROUGE-L (95% confidence interval: 8.94-10.81), 13.21 in METEOR score (95% confidence interval: 12.05-14.37), and 15.82 in CHRF score (95% confidence interval: 13.89-16.44). The performance of fine-tuned LongT5 is close to GPT-3.5 with zero-shot settings. Furthermore, smaller fine-tuned models sometimes even demonstrated superior performance compared to larger zero-shot models. The above trends of improvement were also manifested in both human and GPT4-simulated evaluations. Our results can be applied to guide model selection for tasks demanding particular domain knowledge, such as medical evidence summarization.
Related papers
- MedSlice: Fine-Tuned Large Language Models for Secure Clinical Note Sectioning [2.4060718165478376]
Fine-tuned open-source LLMs can surpass proprietary models in clinical note sectioning.
This study focuses on three sections: History of Present Illness, Interval History, and Assessment and Plan.
arXiv Detail & Related papers (2025-01-23T21:32:09Z) - AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs [70.4578433679737]
We introduce Audio-Visual Trustworthiness assessment Benchmark (AVTrustBench), comprising 600K samples spanning over 9 meticulously crafted tasks.
Using our benchmark we extensively evaluate 13 state-of-the-art AVLLMs.
The findings reveal that the majority of existing models fall significantly short of achieving human-like comprehension.
arXiv Detail & Related papers (2025-01-03T23:03:24Z) - Leveraging Large Language Models for Medical Information Extraction and Query Generation [2.1793134762413433]
This paper introduces a system that integrates large language models (LLMs) into the clinical trial retrieval process.
We evaluate six LLMs for query generation, focusing on open-source and relatively small models that require minimal computational resources.
arXiv Detail & Related papers (2024-10-31T12:01:51Z) - GIVE: Structured Reasoning of Large Language Models with Knowledge Graph Inspired Veracity Extrapolation [108.2008975785364]
Graph Inspired Veracity Extrapolation (GIVE) is a novel reasoning method that merges parametric and non-parametric memories to improve accurate reasoning with minimal external input.
GIVE guides the LLM agent to select the most pertinent expert data (observe), engage in query-specific divergent thinking (reflect), and then synthesize this information to produce the final output (speak)
arXiv Detail & Related papers (2024-10-11T03:05:06Z) - Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation [20.41379322900742]
We introduce FLAMe, a family of Foundational Large Autorater Models.
FLAMe is trained on our large and diverse collection of 100+ quality assessment tasks.
We show that FLAMe can also serve as a powerful starting point for further downstream fine-tuning.
arXiv Detail & Related papers (2024-07-15T15:33:45Z) - DistillSeq: A Framework for Safety Alignment Testing in Large Language Models using Knowledge Distillation [4.340880264464675]
Large Language Models (LLMs) have showcased their remarkable capabilities in diverse domains, encompassing natural language understanding, translation, and even code generation.
The potential for LLMs to generate harmful content is a significant concern.
This study explores cost-saving strategies during the testing phase to balance the need for thorough evaluation with the constraints of resource availability.
arXiv Detail & Related papers (2024-07-14T07:21:54Z) - RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness [102.06442250444618]
We introduce RLAIF-V, a novel framework that aligns MLLMs in a fully open-source paradigm.
RLAIF-V maximally explores open-source MLLMs from two perspectives, including high-quality feedback data generation.
Experiments on six benchmarks in both automatic and human evaluation show that RLAIF-V substantially enhances the trustworthiness of models.
arXiv Detail & Related papers (2024-05-27T14:37:01Z) - Distilling Large Language Models for Matching Patients to Clinical
Trials [3.4068841624198942]
The recent success of large language models (LLMs) has paved the way for their adoption in the high-stakes domain of healthcare.
This study presents the first systematic examination of the efficacy of both proprietary (GPT-3.5, and GPT-4) and open-source LLMs (LLAMA 7B,13B, and 70B) for the task of patient-trial matching.
Our findings reveal that open-source LLMs, when fine-tuned on this limited and synthetic dataset, demonstrate performance parity with their proprietary counterparts.
arXiv Detail & Related papers (2023-12-15T17:11:07Z) - Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks.
How do we evaluate the capabilities of LLMs to consistently produce factually correct answers?
We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z) - Democratizing LLMs: An Exploration of Cost-Performance Trade-offs in
Self-Refined Open-Source Models [53.859446823312126]
SoTA open source models of varying sizes from 7B - 65B, on average, improve 8.2% from their baseline performance.
Strikingly, even models with extremely small memory footprints, such as Vicuna-7B, show a 11.74% improvement overall and up to a 25.39% improvement in high-creativity, open ended tasks.
arXiv Detail & Related papers (2023-10-11T15:56:00Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.