Ambient AI Scribing Support: Comparing the Performance of Specialized AI Agentic Architecture to Leading Foundational Models
- URL: http://arxiv.org/abs/2411.06713v1
- Date: Mon, 11 Nov 2024 04:45:48 GMT
- Title: Ambient AI Scribing Support: Comparing the Performance of Specialized AI Agentic Architecture to Leading Foundational Models
- Authors: Chanseo Lee, Sonu Kumar, Kimon A. Vogt, Sam Meraj,
- Abstract summary: Sporo Health's AI Scribe is a proprietary model fine-tuned for medical scribing.
We analyzed de-identified patient transcripts from partner clinics, using clinician-provided SOAP notes as the ground truth.
Sporo outperformed all models, achieving the highest recall (73.3%), precision (78.6%), and F1 score (75.3%) with the lowest performance variance.
- Score: 0.0
- License:
- Abstract: This study compares Sporo Health's AI Scribe, a proprietary model fine-tuned for medical scribing, with various LLMs (GPT-4o, GPT-3.5, Gemma-9B, and Llama-3.2-3B) in clinical documentation. We analyzed de-identified patient transcripts from partner clinics, using clinician-provided SOAP notes as the ground truth. Each model generated SOAP summaries using zero-shot prompting, with performance assessed via recall, precision, and F1 scores. Sporo outperformed all models, achieving the highest recall (73.3%), precision (78.6%), and F1 score (75.3%) with the lowest performance variance. Statistically significant differences (p < 0.05) were found between Sporo and the other models, with post-hoc tests showing significant improvements over GPT-3.5, Gemma-9B, and Llama 3.2-3B. While Sporo outperformed GPT-4o by up to 10%, the difference was not statistically significant (p = 0.25). Clinical user satisfaction, measured with a modified PDQI-9 inventory, favored Sporo. Evaluations indicated Sporo's outputs were more accurate and relevant. This highlights the potential of Sporo's multi-agentic architecture to improve clinical workflows.
Related papers
- A Comprehensive Evaluation of Large Language Models on Mental Illnesses [0.8458496687170665]
GPT-4 and Llama 3 exhibited superior performance in binary disorder detection, with accuracies reaching up to 85% on certain datasets.
prompt engineering played a crucial role in enhancing model performance.
Despite promising results, our analysis identified several challenges, including variability in performance across datasets and the need for careful prompt engineering.
arXiv Detail & Related papers (2024-09-24T02:58:52Z) - CACER: Clinical Concept Annotations for Cancer Events and Relations [22.866006682711284]
We present Clinical Concept s for Cancer Events and Relations (CACER), a novel corpus with fine-grained annotations for over 48,000 medical problems and drug events.
We develop and evaluate transformer-based information extraction models using fine-tuning and in-context learning.
arXiv Detail & Related papers (2024-09-05T20:42:35Z) - Towards Effective and Efficient Continual Pre-training of Large Language Models [163.34610964970258]
Continual pre-training (CPT) has been an important approach for adapting language models to specific domains or tasks.
This paper presents a technical report for continually pre-training Llama-3 (8B)
It significantly enhances the Chinese language ability and scientific reasoning ability of the backbone model.
arXiv Detail & Related papers (2024-07-26T13:55:21Z) - Using LLMs to label medical papers according to the CIViC evidence model [0.0]
We introduce the sequence classification problem CIViC Evidence to the field of medical NLP.
We fine-tune pretrained checkpoints of BERT and RoBERTa on the CIViC Evidence dataset.
We compare the aforementioned BERT-like models to OpenAI's GPT-4 in a few-shot setting.
arXiv Detail & Related papers (2024-07-05T12:30:01Z) - Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs [54.05511925104712]
We propose a simple, effective, and data-efficient method called Step-DPO.
Step-DPO treats individual reasoning steps as units for preference optimization rather than evaluating answers holistically.
Our findings demonstrate that as few as 10K preference data pairs and fewer than 500 Step-DPO training steps can yield a nearly 3% gain in accuracy on MATH for models with over 70B parameters.
arXiv Detail & Related papers (2024-06-26T17:43:06Z) - Utilizing Large Language Models to Generate Synthetic Data to Increase the Performance of BERT-Based Neural Networks [0.7071166713283337]
We created datasets large enough to train machine learning models.
Our goal is to label behaviors corresponding to autism criteria.
Augmenting data increased recall by 13% but decreased precision by 16%.
arXiv Detail & Related papers (2024-05-08T03:18:12Z) - Using Pre-training and Interaction Modeling for ancestry-specific disease prediction in UK Biobank [69.90493129893112]
Recent genome-wide association studies (GWAS) have uncovered the genetic basis of complex traits, but show an under-representation of non-European descent individuals.
Here, we assess whether we can improve disease prediction across diverse ancestries using multiomic data.
arXiv Detail & Related papers (2024-04-26T16:39:50Z) - The effect of data augmentation and 3D-CNN depth on Alzheimer's Disease
detection [51.697248252191265]
This work summarizes and strictly observes best practices regarding data handling, experimental design, and model evaluation.
We focus on Alzheimer's Disease (AD) detection, which serves as a paradigmatic example of challenging problem in healthcare.
Within this framework, we train predictive 15 models, considering three different data augmentation strategies and five distinct 3D CNN architectures.
arXiv Detail & Related papers (2023-09-13T10:40:41Z) - Large Language Models to Identify Social Determinants of Health in
Electronic Health Records [2.168737004368243]
Social determinants of health (SDoH) have an important impact on patient outcomes but are incompletely collected from the electronic health records (EHRs)
This study researched the ability of large language models to extract SDoH from free text in EHRs, where they are most commonly documented.
800 patient notes were annotated for SDoH categories, and several transformer-based models were evaluated.
arXiv Detail & Related papers (2023-08-11T19:18:35Z) - Evaluating Psychological Safety of Large Language Models [72.88260608425949]
We designed unbiased prompts to evaluate the psychological safety of large language models (LLMs)
We tested five different LLMs by using two personality tests: Short Dark Triad (SD-3) and Big Five Inventory (BFI)
Despite being instruction fine-tuned with safety metrics to reduce toxicity, InstructGPT, GPT-3.5, and GPT-4 still showed dark personality patterns.
Fine-tuning Llama-2-chat-7B with responses from BFI using direct preference optimization could effectively reduce the psychological toxicity of the model.
arXiv Detail & Related papers (2022-12-20T18:45:07Z) - Clinical Deterioration Prediction in Brazilian Hospitals Based on
Artificial Neural Networks and Tree Decision Models [56.93322937189087]
An extremely boosted neural network (XBNet) is used to predict clinical deterioration (CD)
The XGBoost model obtained the best results in predicting CD among Brazilian hospitals' data.
arXiv Detail & Related papers (2022-12-17T23:29:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.