TN-Eval: Rubric and Evaluation Protocols for Measuring the Quality of Behavioral Therapy Notes
- URL: http://arxiv.org/abs/2503.20648v1
- Date: Wed, 26 Mar 2025 15:40:40 GMT
- Title: TN-Eval: Rubric and Evaluation Protocols for Measuring the Quality of Behavioral Therapy Notes
- Authors: Raj Sanjay Shah, Lei Xu, Qianchu Liu, Jon Burnsky, Drew Bertagnolli, Chaitanya Shivade,
- Abstract summary: Quality standards for behavioral therapy notes remain underdeveloped.<n>A rubric-based manual evaluation protocol offers more reliable and interpretable results than traditional Likert-scale annotations.<n>In a blind test, therapists prefer and judge LLM-generated notes to be superior to therapist-written notes.
- Score: 3.9806397855028983
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Behavioral therapy notes are important for both legal compliance and patient care. Unlike progress notes in physical health, quality standards for behavioral therapy notes remain underdeveloped. To address this gap, we collaborated with licensed therapists to design a comprehensive rubric for evaluating therapy notes across key dimensions: completeness, conciseness, and faithfulness. Further, we extend a public dataset of behavioral health conversations with therapist-written notes and LLM-generated notes, and apply our evaluation framework to measure their quality. We find that: (1) A rubric-based manual evaluation protocol offers more reliable and interpretable results than traditional Likert-scale annotations. (2) LLMs can mimic human evaluators in assessing completeness and conciseness but struggle with faithfulness. (3) Therapist-written notes often lack completeness and conciseness, while LLM-generated notes contain hallucination. Surprisingly, in a blind test, therapists prefer and judge LLM-generated notes to be superior to therapist-written notes.
Related papers
- Reframe Your Life Story: Interactive Narrative Therapist and Innovative Moment Assessment with Large Language Models [92.93521294357058]
Narrative therapy helps individuals transform problematic life stories into empowering alternatives.<n>Current approaches lack realism in specialized psychotherapy and fail to capture therapeutic progression over time.<n>Int (Interactive Narrative Therapist) simulates expert narrative therapists by planning therapeutic stages, guiding reflection levels, and generating contextually appropriate expert-like responses.
arXiv Detail & Related papers (2025-07-27T11:52:09Z) - From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes [26.750112195124284]
We propose a pipeline that distills real user feedback into structured checklists for note evaluation.<n>Using deidentified data from over 21,000 clinical encounters, we show that our feedback-derived checklist outperforms baseline approaches.<n>In offline research settings, the checklist can help identify notes likely to fall below our chosen quality thresholds.
arXiv Detail & Related papers (2025-07-23T17:28:31Z) - AutoMedEval: Harnessing Language Models for Automatic Medical Capability Evaluation [55.2739790399209]
We present AutoMedEval, an open-sourced automatic evaluation model with 13B parameters specifically engineered to measure the question-answering proficiency of medical LLMs.<n>The overarching objective of AutoMedEval is to assess the quality of responses produced by diverse models, aspiring to significantly reduce the dependence on human evaluation.
arXiv Detail & Related papers (2025-05-17T07:44:54Z) - Med-CoDE: Medical Critique based Disagreement Evaluation Framework [72.42301910238861]
The reliability and accuracy of large language models (LLMs) in medical contexts remain critical concerns.
Current evaluation methods often lack robustness and fail to provide a comprehensive assessment of LLM performance.
We propose Med-CoDE, a specifically designed evaluation framework for medical LLMs to address these challenges.
arXiv Detail & Related papers (2025-04-21T16:51:11Z) - Query-Guided Self-Supervised Summarization of Nursing Notes [5.835276312834499]
We introduce QGSumm, a novel query-guided self-supervised domain adaptation approach for abstractive nursing note summarization.<n>We study our approach and other state-of-the-art Large Language Models (LLMs) for nursing note summarization.
arXiv Detail & Related papers (2024-07-04T18:54:30Z) - Towards Adapting Open-Source Large Language Models for Expert-Level Clinical Note Generation [19.08691249610632]
This study presents a comprehensive domain- and task-specific adaptation process for the open-source LLaMA-2 13 billion parameter model.<n>Our process incorporates continued pretraining, supervised fine-tuning, and reinforcement learning from both AI and human feedback.<n>Our resulting model, LLaMA-Clinic, can generate clinical notes comparable in quality to those authored by physicians.
arXiv Detail & Related papers (2024-04-25T15:34:53Z) - Attribute Structuring Improves LLM-Based Evaluation of Clinical Text Summaries [56.31117605097345]
Large language models (LLMs) have shown the potential to generate accurate clinical text summaries, but still struggle with issues regarding grounding and evaluation.<n>Here, we explore a general mitigation framework using Attribute Structuring (AS), which structures the summary evaluation process.<n>AS consistently improves the correspondence between human annotations and automated metrics in clinical text summarization.
arXiv Detail & Related papers (2024-03-01T21:59:03Z) - Exploring the Efficacy of Large Language Models in Summarizing Mental
Health Counseling Sessions: A Benchmark Study [17.32433545370711]
Comprehensive summaries of sessions enable an effective continuity in mental health counseling.
Manual summarization presents a significant challenge, diverting experts' attention from the core counseling process.
This study evaluates the effectiveness of state-of-the-art Large Language Models (LLMs) in selectively summarizing various components of therapy sessions.
arXiv Detail & Related papers (2024-02-29T11:29:47Z) - FactPICO: Factuality Evaluation for Plain Language Summarization of Medical Evidence [46.71469172542448]
This paper presents FactPICO, a factuality benchmark for plain language summarization of medical texts.
It consists of 345 plain language summaries of abstracts generated from three randomized controlled trials (RCTs)
We assess the factuality of critical elements of RCTs in those summaries, as well as the reported findings concerning these.
arXiv Detail & Related papers (2024-02-18T04:45:01Z) - Evaluation of General Large Language Models in Contextually Assessing
Semantic Concepts Extracted from Adult Critical Care Electronic Health Record
Notes [17.648021186810663]
The purpose of this study was to evaluate the performance of Large Language Models (LLMs) in understanding and processing real-world clinical notes.
The GPT family models have demonstrated considerable efficiency, evidenced by their cost-effectiveness and time-saving capabilities.
arXiv Detail & Related papers (2024-01-24T16:52:37Z) - Automated Scoring of Clinical Patient Notes using Advanced NLP and
Pseudo Labeling [2.711804338865226]
This research introduces an approach leveraging state-of-the-art Natural Language Processing (NLP) techniques.
Our methodology enhances efficiency and effectiveness, significantly reducing training time without compromising performance.
arXiv Detail & Related papers (2024-01-18T05:17:18Z) - Consultation Checklists: Standardising the Human Evaluation of Medical
Note Generation [58.54483567073125]
We propose a protocol that aims to increase objectivity by grounding evaluations in Consultation Checklists.
We observed good levels of inter-annotator agreement in a first evaluation study using the protocol.
arXiv Detail & Related papers (2022-11-17T10:54:28Z) - Human Evaluation and Correlation with Automatic Metrics in Consultation
Note Generation [56.25869366777579]
In recent years, machine learning models have rapidly become better at generating clinical consultation notes.
We present an extensive human evaluation study where 5 clinicians listen to 57 mock consultations, write their own notes, post-edit a number of automatically generated notes, and extract all the errors.
We find that a simple, character-based Levenshtein distance metric performs on par if not better than common model-based metrics like BertScore.
arXiv Detail & Related papers (2022-04-01T14:04:16Z) - A preliminary study on evaluating Consultation Notes with Post-Editing [67.30200768442926]
We propose a semi-automatic approach whereby physicians post-edit generated notes before submitting them.
We conduct a preliminary study on the time saving of automatically generated consultation notes with post-editing.
We time this and find that it is faster than writing the note from scratch.
arXiv Detail & Related papers (2021-04-09T14:42:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.