Related papers: A Principle-based Framework for the Development and Evaluation of Large Language Models for Health and Wellness

A Principle-based Framework for the Development and Evaluation of Large Language Models for Health and Wellness

URL: http://arxiv.org/abs/2512.08936v1
Date: Thu, 23 Oct 2025 06:54:33 GMT
Title: A Principle-based Framework for the Development and Evaluation of Large Language Models for Health and Wellness
Authors: Brent Winslow, Jacqueline Shreibati, Javier Perez, Hao-Wei Su, Nichole Young-Lin, Nova Hammerquist, Daniel McDuff, Jason Guss, Jenny Vafeiadou, Nick Cain, Alex Lin, Erik Schenck, Shiva Rajagopal, Jia-Ru Chung, Anusha Venkatakrishnan, Amy Armento Lee, Maryam Karimzadehgan, Qingyou Meng, Rythm Agarwal, Aravind Natarajan, Tracy Giest,
Abstract summary: This paper describes the development of the Fitbit Insights explorer, a large language model (LLM)-powered system designed to help users interpret their personal health data.<n>It introduces the safety, helpfulness, accuracy, relevance, and personalization (SHARP) principle-based framework.<n>It integrates comprehensive evaluation techniques including human evaluation by generalists and clinical specialists, autorater assessments, and adversarial testing.
Score: 7.135227672247848
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The incorporation of generative artificial intelligence into personal health applications presents a transformative opportunity for personalized, data-driven health and fitness guidance, yet also poses challenges related to user safety, model accuracy, and personal privacy. To address these challenges, a novel, principle-based framework was developed and validated for the systematic evaluation of LLMs applied to personal health and wellness. First, the development of the Fitbit Insights explorer, a large language model (LLM)-powered system designed to help users interpret their personal health data, is described. Subsequently, the safety, helpfulness, accuracy, relevance, and personalization (SHARP) principle-based framework is introduced as an end-to-end operational methodology that integrates comprehensive evaluation techniques including human evaluation by generalists and clinical specialists, autorater assessments, and adversarial testing, into an iterative development lifecycle. Through the application of this framework to the Fitbit Insights explorer in a staged deployment involving over 13,000 consented users, challenges not apparent during initial testing were systematically identified. This process guided targeted improvements to the system and demonstrated the necessity of combining isolated technical evaluations with real-world user feedback. Finally, a comprehensive, actionable approach is established for the responsible development and deployment of LLM-powered health applications, providing a standardized methodology to foster innovation while ensuring emerging technologies are safe, effective, and trustworthy for users.

Related papers

Responsible Evaluation of AI for Mental Health [72.85175110624736]
Current approaches to evaluating AI tools in mental health care are fragmented and poorly aligned with clinical practice, social context, and first-hand user experience.<n>This paper argues for a rethinking of responsible evaluation by introducing an interdisciplinary framework that integrates clinical soundness, social context, and equity.
arXiv Detail & Related papers (2026-01-20T12:55:10Z)
ChroniUXMag: A Persona-Driven Framework for Inclusive mHealth Requirements Engineering [6.574640199180087]
This study introduces ChroniUXMag, a framework for eliciting and analysing inclusivity requirements in mHealth design.<n>Building on InclusiveMag and GenderMag principles, the framework aims to help researchers and practitioners systematically capture and evaluate factors that influence how individuals with chronic conditions perceive, trust, and interact with mHealth systems.
arXiv Detail & Related papers (2025-11-23T22:20:13Z)
From Framework to Practice: Designing a Real-World Telehealth Application for Palliative Care [9.062051939081783]
This paper presents an analysis of designing a software application focused on Enhanced Telehealth Capabilities (ETHC) for palliative care.<n>Our socio-technical design framework was successful in producing a secure, equitable, and resilient digital health application.
arXiv Detail & Related papers (2025-11-01T12:14:25Z)
Mentalic Net: Development of RAG-based Conversational AI and Evaluation Framework for Mental Health Support [0.0]
Mentalic Net Conversational AI has a BERT Score of 0.898, with other evaluation metrics falling within satisfactory ranges.<n>We advocate for a human-in-the-loop approach and a long-term, responsible strategy in developing such transformative technologies.
arXiv Detail & Related papers (2025-08-27T03:44:56Z)
Medical Reasoning in the Era of LLMs: A Systematic Review of Enhancement Techniques and Applications [59.721265428780946]
Large Language Models (LLMs) in medicine have enabled impressive capabilities, yet a critical gap remains in their ability to perform systematic, transparent, and verifiable reasoning.<n>This paper provides the first systematic review of this emerging field.<n>We propose a taxonomy of reasoning enhancement techniques, categorized into training-time strategies and test-time mechanisms.
arXiv Detail & Related papers (2025-08-01T14:41:31Z)
Advancing Embodied Agent Security: From Safety Benchmarks to Input Moderation [52.83870601473094]
Embodied agents exhibit immense potential across a multitude of domains.<n>Existing research predominantly concentrates on the security of general large language models.<n>This paper introduces a novel input moderation framework, meticulously designed to safeguard embodied agents.
arXiv Detail & Related papers (2025-04-22T08:34:35Z)
On the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and Perspective [377.2483044466149]
Generative Foundation Models (GenFMs) have emerged as transformative tools.<n>Their widespread adoption raises critical concerns regarding trustworthiness across dimensions.<n>This paper presents a comprehensive framework to address these challenges through three key contributions.
arXiv Detail & Related papers (2025-02-20T06:20:36Z)
AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons [62.374792825813394]
This paper introduces AILuminate v1.0, the first comprehensive industry-standard benchmark for assessing AI-product risk and reliability.<n>The benchmark evaluates an AI system's resistance to prompts designed to elicit dangerous, illegal, or undesirable behavior in 12 hazard categories.
arXiv Detail & Related papers (2025-02-19T05:58:52Z)
Guiding IoT-Based Healthcare Alert Systems with Large Language Models [22.54714587190204]
Healthcare alert systems (HAS) are undergoing rapid evolution, propelled by advancements in artificial intelligence (AI), Internet of Things (IoT) technologies, and increasing health consciousness. Despite significant progress, a fundamental challenge remains: balancing the accuracy of personalized health alerts with stringent privacy protection in HAS environments constrained by resources. We introduce a uniform framework, LLM-HAS, which incorporates Large Language Models (LLM) into HAS to significantly boost the accuracy, ensure user privacy, and enhance personalized health service.
arXiv Detail & Related papers (2024-08-23T13:55:36Z)
A Survey on Medical Large Language Models: Technology, Application, Trustworthiness, and Future Directions [23.36640449085249]
We trace the recent advances of Medical Large Language Models (Med-LLMs)<n>The wide-ranging applications of Med-LLMs are investigated across various healthcare domains.<n>We discuss the challenges associated with ensuring fairness, accountability, privacy, and robustness.
arXiv Detail & Related papers (2024-06-06T03:15:13Z)
Evaluating the Safety of Deep Reinforcement Learning Models using Semi-Formal Verification [81.32981236437395]
We present a semi-formal verification approach for decision-making tasks based on interval analysis. Our method obtains comparable results over standard benchmarks with respect to formal verifiers. Our approach allows to efficiently evaluate safety properties for decision-making models in practical applications.
arXiv Detail & Related papers (2020-10-19T11:18:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.