TN-Eval: Rubric and Evaluation Protocols for Measuring the Quality of Behavioral Therapy Notes
- URL: http://arxiv.org/abs/2503.20648v1
- Date: Wed, 26 Mar 2025 15:40:40 GMT
- Title: TN-Eval: Rubric and Evaluation Protocols for Measuring the Quality of Behavioral Therapy Notes
- Authors: Raj Sanjay Shah, Lei Xu, Qianchu Liu, Jon Burnsky, Drew Bertagnolli, Chaitanya Shivade,
- Abstract summary: Quality standards for behavioral therapy notes remain underdeveloped.<n>A rubric-based manual evaluation protocol offers more reliable and interpretable results than traditional Likert-scale annotations.<n>In a blind test, therapists prefer and judge LLM-generated notes to be superior to therapist-written notes.
- Score: 3.9806397855028983
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Behavioral therapy notes are important for both legal compliance and patient care. Unlike progress notes in physical health, quality standards for behavioral therapy notes remain underdeveloped. To address this gap, we collaborated with licensed therapists to design a comprehensive rubric for evaluating therapy notes across key dimensions: completeness, conciseness, and faithfulness. Further, we extend a public dataset of behavioral health conversations with therapist-written notes and LLM-generated notes, and apply our evaluation framework to measure their quality. We find that: (1) A rubric-based manual evaluation protocol offers more reliable and interpretable results than traditional Likert-scale annotations. (2) LLMs can mimic human evaluators in assessing completeness and conciseness but struggle with faithfulness. (3) Therapist-written notes often lack completeness and conciseness, while LLM-generated notes contain hallucination. Surprisingly, in a blind test, therapists prefer and judge LLM-generated notes to be superior to therapist-written notes.
Related papers
- Med-CoDE: Medical Critique based Disagreement Evaluation Framework [72.42301910238861]
The reliability and accuracy of large language models (LLMs) in medical contexts remain critical concerns.
Current evaluation methods often lack robustness and fail to provide a comprehensive assessment of LLM performance.
We propose Med-CoDE, a specifically designed evaluation framework for medical LLMs to address these challenges.
arXiv Detail & Related papers (2025-04-21T16:51:11Z) - Query-Guided Self-Supervised Summarization of Nursing Notes [5.835276312834499]
We introduce QGSumm, a novel query-guided self-supervised domain adaptation approach for abstractive nursing note summarization.<n>We study our approach and other state-of-the-art Large Language Models (LLMs) for nursing note summarization.
arXiv Detail & Related papers (2024-07-04T18:54:30Z) - Attribute Structuring Improves LLM-Based Evaluation of Clinical Text Summaries [56.31117605097345]
Large language models (LLMs) have shown the potential to generate accurate clinical text summaries, but still struggle with issues regarding grounding and evaluation.<n>Here, we explore a general mitigation framework using Attribute Structuring (AS), which structures the summary evaluation process.<n>AS consistently improves the correspondence between human annotations and automated metrics in clinical text summarization.
arXiv Detail & Related papers (2024-03-01T21:59:03Z) - Exploring the Efficacy of Large Language Models in Summarizing Mental
Health Counseling Sessions: A Benchmark Study [17.32433545370711]
Comprehensive summaries of sessions enable an effective continuity in mental health counseling.
Manual summarization presents a significant challenge, diverting experts' attention from the core counseling process.
This study evaluates the effectiveness of state-of-the-art Large Language Models (LLMs) in selectively summarizing various components of therapy sessions.
arXiv Detail & Related papers (2024-02-29T11:29:47Z) - FactPICO: Factuality Evaluation for Plain Language Summarization of Medical Evidence [46.71469172542448]
This paper presents FactPICO, a factuality benchmark for plain language summarization of medical texts.
It consists of 345 plain language summaries of abstracts generated from three randomized controlled trials (RCTs)
We assess the factuality of critical elements of RCTs in those summaries, as well as the reported findings concerning these.
arXiv Detail & Related papers (2024-02-18T04:45:01Z) - Evaluation of General Large Language Models in Contextually Assessing
Semantic Concepts Extracted from Adult Critical Care Electronic Health Record
Notes [17.648021186810663]
The purpose of this study was to evaluate the performance of Large Language Models (LLMs) in understanding and processing real-world clinical notes.
The GPT family models have demonstrated considerable efficiency, evidenced by their cost-effectiveness and time-saving capabilities.
arXiv Detail & Related papers (2024-01-24T16:52:37Z) - Automated Scoring of Clinical Patient Notes using Advanced NLP and
Pseudo Labeling [2.711804338865226]
This research introduces an approach leveraging state-of-the-art Natural Language Processing (NLP) techniques.
Our methodology enhances efficiency and effectiveness, significantly reducing training time without compromising performance.
arXiv Detail & Related papers (2024-01-18T05:17:18Z) - Consultation Checklists: Standardising the Human Evaluation of Medical
Note Generation [58.54483567073125]
We propose a protocol that aims to increase objectivity by grounding evaluations in Consultation Checklists.
We observed good levels of inter-annotator agreement in a first evaluation study using the protocol.
arXiv Detail & Related papers (2022-11-17T10:54:28Z) - Human Evaluation and Correlation with Automatic Metrics in Consultation
Note Generation [56.25869366777579]
In recent years, machine learning models have rapidly become better at generating clinical consultation notes.
We present an extensive human evaluation study where 5 clinicians listen to 57 mock consultations, write their own notes, post-edit a number of automatically generated notes, and extract all the errors.
We find that a simple, character-based Levenshtein distance metric performs on par if not better than common model-based metrics like BertScore.
arXiv Detail & Related papers (2022-04-01T14:04:16Z) - A preliminary study on evaluating Consultation Notes with Post-Editing [67.30200768442926]
We propose a semi-automatic approach whereby physicians post-edit generated notes before submitting them.
We conduct a preliminary study on the time saving of automatically generated consultation notes with post-editing.
We time this and find that it is faster than writing the note from scratch.
arXiv Detail & Related papers (2021-04-09T14:42:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.