Related papers: Closing the gap between open-source and commercial large language models for medical evidence summarization

Closing the gap between open-source and commercial large language models for medical evidence summarization

URL: http://arxiv.org/abs/2408.00588v1
Date: Thu, 25 Jul 2024 05:03:01 GMT
Title: Closing the gap between open-source and commercial large language models for medical evidence summarization
Authors: Gongbo Zhang, Qiao Jin, Yiliang Zhou, Song Wang, Betina R. Idnay, Yiming Luo, Elizabeth Park, Jordan G. Nestor, Matthew E. Spotnitz, Ali Soroush, Thomas Campion, Zhiyong Lu, Chunhua Weng, Yifan Peng,
Abstract summary: Large language models (LLMs) hold great promise in summarizing medical evidence. Most recent studies focus on the application of proprietary LLMs. While open-source LLMs allow better transparency and customization, their performance falls short compared to proprietary ones.
Score: 20.60798771155072
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large language models (LLMs) hold great promise in summarizing medical evidence. Most recent studies focus on the application of proprietary LLMs. Using proprietary LLMs introduces multiple risk factors, including a lack of transparency and vendor dependency. While open-source LLMs allow better transparency and customization, their performance falls short compared to proprietary ones. In this study, we investigated to what extent fine-tuning open-source LLMs can further improve their performance in summarizing medical evidence. Utilizing a benchmark dataset, MedReview, consisting of 8,161 pairs of systematic reviews and summaries, we fine-tuned three broadly-used, open-sourced LLMs, namely PRIMERA, LongT5, and Llama-2. Overall, the fine-tuned LLMs obtained an increase of 9.89 in ROUGE-L (95% confidence interval: 8.94-10.81), 13.21 in METEOR score (95% confidence interval: 12.05-14.37), and 15.82 in CHRF score (95% confidence interval: 13.89-16.44). The performance of fine-tuned LongT5 is close to GPT-3.5 with zero-shot settings. Furthermore, smaller fine-tuned models sometimes even demonstrated superior performance compared to larger zero-shot models. The above trends of improvement were also manifested in both human and GPT4-simulated evaluations. Our results can be applied to guide model selection for tasks demanding particular domain knowledge, such as medical evidence summarization.

Related papers

Structuring Radiology Reports: Challenging LLMs with Lightweight Models [5.01440254761063]
Large language models (LLMs) have shown strong capabilities in reformatting clinical text, their high computational requirements, lack of transparency, and data privacy concerns hinder practical deployment.<n>We explore lightweight encoder-decoder models (300M parameters)-specifically T5 and BERT2BERT-for structuring radiology reports from the MIMIC-CXR and CheXpert Plus datasets.<n>Our best-performing lightweight model outperforms all LLMs adapted using prompt-based techniques on a human-annotated test set.
arXiv Detail & Related papers (2025-05-30T20:12:51Z)
Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't [0.0]
Our study investigates the potential of reinforcement learning to improve reasoning in small language models (LLMs) Training on 4 NVIDIA A40 GPUs (48 GB VRAM each) within 24 hours resulted in rapid reasoning gains. These findings highlight the efficacy of RL-based fine-tuning for small LLMs, offering a cost-effective alternative to large-scale approaches.
arXiv Detail & Related papers (2025-03-20T15:13:23Z)
Bridging the LLM Accessibility Divide? Performance, Fairness, and Cost of Closed versus Open LLMs for Automated Essay Scoring [18.33969226071914]
We compare nine leading large language models (LLMs) across text assessment and generation tasks related to automated essay scoring. Our findings reveal that for few-shot learning-based assessment of human generated essays, open LLMs such as Llama 3 and Qwen2.5 perform comparably to GPT-4 in terms of predictive performance. For generative tasks, we find that essays generated by top open LLMs are comparable to closed LLMs in terms of their semantic composition/embeddings and ML assessed scores.
arXiv Detail & Related papers (2025-03-14T19:34:40Z)
MedSlice: Fine-Tuned Large Language Models for Secure Clinical Note Sectioning [2.4060718165478376]
Fine-tuned open-source LLMs can surpass proprietary models in clinical note sectioning. This study focuses on three sections: History of Present Illness, Interval History, and Assessment and Plan.
arXiv Detail & Related papers (2025-01-23T21:32:09Z)
AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs [70.4578433679737]
We introduce Audio-Visual Trustworthiness assessment Benchmark (AVTrustBench), comprising 600K samples spanning over 9 meticulously crafted tasks. Using our benchmark we extensively evaluate 13 state-of-the-art AVLLMs. The findings reveal that the majority of existing models fall significantly short of achieving human-like comprehension.
arXiv Detail & Related papers (2025-01-03T23:03:24Z)
CDR: Customizable Density Ratios of Strong-over-weak LLMs for Preference Annotation [15.776175440446414]
Preference tuning of large language models (LLMs) relies on high-quality human preference data. We introduce customized density ratio (CDR), a training-free and highly effective method that leverages off-the-shelf LLMs for preference data annotation. We show that tailoring the density ratio reward function with specific criteria and preference exemplars enhances performance across domains and within target areas.
arXiv Detail & Related papers (2024-11-04T18:54:39Z)
GIVE: Structured Reasoning of Large Language Models with Knowledge Graph Inspired Veracity Extrapolation [108.2008975785364]
Graph Inspired Veracity Extrapolation (GIVE) is a novel reasoning method that merges parametric and non-parametric memories to improve accurate reasoning with minimal external input. GIVE guides the LLM agent to select the most pertinent expert data (observe), engage in query-specific divergent thinking (reflect), and then synthesize this information to produce the final output (speak)
arXiv Detail & Related papers (2024-10-11T03:05:06Z)
Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse [27.26121507279163]
We introduce Trust-Score, a holistic metric that evaluates the trustworthiness of LLMs within the RAG framework. We propose Trust-Align, a method to align LLMs for improved Trust-Score performance. 26 out of 27 models aligned using Trust-Align substantially outperform competitive baselines on ASQA, QAMPARI, and ELI5.
arXiv Detail & Related papers (2024-09-17T14:47:33Z)
Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation [20.41379322900742]
We introduce FLAMe, a family of Foundational Large Autorater Models. FLAMe is trained on our large and diverse collection of 100+ quality assessment tasks. We show that FLAMe can also serve as a powerful starting point for further downstream fine-tuning.
arXiv Detail & Related papers (2024-07-15T15:33:45Z)
DistillSeq: A Framework for Safety Alignment Testing in Large Language Models using Knowledge Distillation [4.340880264464675]
Large Language Models (LLMs) have showcased their remarkable capabilities in diverse domains, encompassing natural language understanding, translation, and even code generation. The potential for LLMs to generate harmful content is a significant concern. This study explores cost-saving strategies during the testing phase to balance the need for thorough evaluation with the constraints of resource availability.
arXiv Detail & Related papers (2024-07-14T07:21:54Z)
RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness [94.03511733306296]
We introduce RLAIF-V, a framework that aligns MLLMs in a fully open-source paradigm for super GPT-4V trustworthiness. RLAIF-V maximally exploits the open-source feedback from two perspectives, including high-quality feedback data and online feedback learning algorithm. Experiments show that RLAIF-V substantially enhances the trustworthiness of models without sacrificing performance on other tasks.
arXiv Detail & Related papers (2024-05-27T14:37:01Z)
How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts [54.07541591018305]
We present MAD-Bench, a benchmark that contains 1000 test samples divided into 5 categories, such as non-existent objects, count of objects, and spatial relationship. We provide a comprehensive analysis of popular MLLMs, ranging from GPT-4v, Reka, Gemini-Pro, to open-sourced models, such as LLaVA-NeXT and MiniCPM-Llama3. While GPT-4o achieves 82.82% accuracy on MAD-Bench, the accuracy of any other model in our experiments ranges from 9% to 50%.
arXiv Detail & Related papers (2024-02-20T18:31:27Z)
Distilling Large Language Models for Matching Patients to Clinical Trials [3.4068841624198942]
The recent success of large language models (LLMs) has paved the way for their adoption in the high-stakes domain of healthcare. This study presents the first systematic examination of the efficacy of both proprietary (GPT-3.5, and GPT-4) and open-source LLMs (LLAMA 7B,13B, and 70B) for the task of patient-trial matching. Our findings reveal that open-source LLMs, when fine-tuned on this limited and synthetic dataset, demonstrate performance parity with their proprietary counterparts.
arXiv Detail & Related papers (2023-12-15T17:11:07Z)
Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks. How do we evaluate the capabilities of LLMs to consistently produce factually correct answers? We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z)
Democratizing LLMs: An Exploration of Cost-Performance Trade-offs in Self-Refined Open-Source Models [53.859446823312126]
SoTA open source models of varying sizes from 7B - 65B, on average, improve 8.2% from their baseline performance. Strikingly, even models with extremely small memory footprints, such as Vicuna-7B, show a 11.74% improvement overall and up to a 25.39% improvement in high-creativity, open ended tasks.
arXiv Detail & Related papers (2023-10-11T15:56:00Z)
LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits. Most LLMs struggle on SummEdits, with performance close to random chance. The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)
On Learning to Summarize with Large Language Models as References [101.79795027550959]
Large language models (LLMs) are favored by human annotators over the original reference summaries in commonly used summarization datasets. We study an LLM-as-reference learning setting for smaller text summarization models to investigate whether their performance can be substantially improved.
arXiv Detail & Related papers (2023-05-23T16:56:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.