Related papers: Comparing Two Model Designs for Clinical Note Generation; Is an LLM a Useful Evaluator of Consistency?

Comparing Two Model Designs for Clinical Note Generation; Is an LLM a Useful Evaluator of Consistency?

URL: http://arxiv.org/abs/2404.06503v1
Date: Tue, 9 Apr 2024 17:54:10 GMT
Title: Comparing Two Model Designs for Clinical Note Generation; Is an LLM a Useful Evaluator of Consistency?
Authors: Nathan Brake, Thomas Schaaf,
Abstract summary: We analyze two approaches to generate different sections of a SOAP note based on the audio recording of the conversation. We show that both methods lead to similar ROUGE values and have no difference in terms of the Factuality metric.
Score: 3.019130210299794
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Following an interaction with a patient, physicians are responsible for the submission of clinical documentation, often organized as a SOAP note. A clinical note is not simply a summary of the conversation but requires the use of appropriate medical terminology. The relevant information can then be extracted and organized according to the structure of the SOAP note. In this paper we analyze two different approaches to generate the different sections of a SOAP note based on the audio recording of the conversation, and specifically examine them in terms of note consistency. The first approach generates the sections independently, while the second method generates them all together. In this work we make use of PEGASUS-X Transformer models and observe that both methods lead to similar ROUGE values (less than 1% difference) and have no difference in terms of the Factuality metric. We perform a human evaluation to measure aspects of consistency and demonstrate that LLMs like Llama2 can be used to perform the same tasks with roughly the same agreement as the human annotators. Between the Llama2 analysis and the human reviewers we observe a Cohen Kappa inter-rater reliability of 0.79, 1.00, and 0.32 for consistency of age, gender, and body part injury, respectively. With this we demonstrate the usefulness of leveraging an LLM to measure quality indicators that can be identified by humans but are not currently captured by automatic metrics. This allows scaling evaluation to larger data sets, and we find that clinical note consistency improves by generating each new section conditioned on the output of all previously generated sections.

Related papers

Beyond correlation: The impact of human uncertainty in measuring the effectiveness of automatic evaluation and LLM-as-a-judge [51.93909886542317]
We show how a single aggregate correlation score can obscure differences between human behavior and automatic evaluation methods. We propose stratifying results by human label uncertainty to provide a more robust analysis of automatic evaluation performance.
arXiv Detail & Related papers (2024-10-03T03:08:29Z)
Contrastive Learning with Counterfactual Explanations for Radiology Report Generation [83.30609465252441]
We propose a textbfCountertextbfFactual textbfExplanations-based framework (CoFE) for radiology report generation. Counterfactual explanations serve as a potent tool for understanding how decisions made by algorithms can be changed by asking what if'' scenarios. Experiments on two benchmarks demonstrate that leveraging the counterfactual explanations enables CoFE to generate semantically coherent and factually complete reports.
arXiv Detail & Related papers (2024-07-19T17:24:25Z)
An Investigation of Evaluation Metrics for Automated Medical Note Generation [5.094623170336122]
We study evaluation methods and metrics for the automatic generation of clinical notes from medical conversations. To study the correlation between the automatic metrics and manual judgments, we evaluate automatic notes/summaries by comparing the system and reference facts.
arXiv Detail & Related papers (2023-05-27T04:34:58Z)
Revisiting the Evaluation of Image Synthesis with GANs [55.72247435112475]
This study presents an empirical investigation into the evaluation of synthesis performance, with generative adversarial networks (GANs) as a representative of generative models. In particular, we make in-depth analyses of various factors, including how to represent a data point in the representation space, how to calculate a fair distance using selected samples, and how many instances to use from each set.
arXiv Detail & Related papers (2023-04-04T17:54:32Z)
A Marker-based Neural Network System for Extracting Social Determinants of Health [12.6970199179668]
Social determinants of health (SDoH) on patients' healthcare quality and the disparity is well-known. Many SDoH items are not coded in structured forms in electronic health records. We explore a multi-stage pipeline involving named entity recognition (NER), relation classification (RC), and text classification methods to extract SDoH information from clinical notes automatically.
arXiv Detail & Related papers (2022-12-24T18:40:23Z)
Ontology-aware Learning and Evaluation for Audio Tagging [56.59107110017436]
Mean average precision (mAP) metric treats different kinds of sound as independent classes without considering their relations. Ontology-aware mean average precision (OmAP) addresses the weaknesses of mAP by utilizing the AudioSet ontology information during the evaluation. We conduct human evaluations and demonstrate that OmAP is more consistent with human perception than mAP.
arXiv Detail & Related papers (2022-11-22T11:35:14Z)
Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation [56.25869366777579]
In recent years, machine learning models have rapidly become better at generating clinical consultation notes. We present an extensive human evaluation study where 5 clinicians listen to 57 mock consultations, write their own notes, post-edit a number of automatically generated notes, and extract all the errors. We find that a simple, character-based Levenshtein distance metric performs on par if not better than common model-based metrics like BertScore.
arXiv Detail & Related papers (2022-04-01T14:04:16Z)
SparGE: Sparse Coding-based Patient Similarity Learning via Low-rank Constraints and Graph Embedding [6.2383530999725565]
Patient similarity assessment (PSA) is pivotal to evidence-based and personalized medicine. Machine learning approaches for PSA has to deal with inherent data deficiencies of EHRs. SparGE measures similarity by jointly sparse coding and graph embedding.
arXiv Detail & Related papers (2022-02-03T06:01:07Z)
MIMICause : Defining, identifying and predicting types of causal relationships between biomedical concepts from clinical notes [0.0]
We propose annotation guidelines, develop an annotated corpus and provide baseline scores to identify types and direction of causal relations between a pair of biomedical concepts in clinical notes. We annotate a total of 2714 de-identified examples sampled from the 2018 n2c2 shared task dataset and train four different language model based architectures. The high inter-annotator agreement for clinical text shows the quality of our annotation guidelines while the provided baseline F1 score sets the direction for future research towards understanding narratives in clinical texts.
arXiv Detail & Related papers (2021-10-14T00:15:36Z)
A Statistical Analysis of Summarization Evaluation Metrics using Resampling Methods [60.04142561088524]
We find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are. Although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.
arXiv Detail & Related papers (2021-03-31T18:28:14Z)
MSED: a multi-modal sleep event detection model for clinical sleep analysis [62.997667081978825]
We designed a single deep neural network architecture to jointly detect sleep events in a polysomnogram. The performance of the model was quantified by F1, precision, and recall scores, and by correlating index values to clinical values.
arXiv Detail & Related papers (2021-01-07T13:08:44Z)
Towards an Automated SOAP Note: Classifying Utterances from Medical Conversations [0.6875312133832078]
We bridge the gap for classifying utterances from medical conversations according to (i) the SOAP section and (ii) the speaker role. We present a systematic analysis in which we adapt an existing deep learning architecture to the two aforementioned tasks. The results suggest that modelling context in a hierarchical manner, which captures both word and utterance level context, yields substantial improvements on both classification tasks.
arXiv Detail & Related papers (2020-07-17T04:19:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.