A Mixed-Methods Evaluation of LLM-Based Chatbots for Menopause
- URL: http://arxiv.org/abs/2502.03579v1
- Date: Wed, 05 Feb 2025 19:56:52 GMT
- Title: A Mixed-Methods Evaluation of LLM-Based Chatbots for Menopause
- Authors: Roshini Deva, Manvi S, Jasmine Zhou, Elizabeth Britton Chahine, Agena Davenport-Nicholson, Nadi Nina Kaonga, Selen Bozkurt, Azra Ismail,
- Abstract summary: The integration of Large Language Models (LLMs) into healthcare settings has gained significant attention.
We examine the performance of publicly available LLM-based chatbots for menopause-related queries.
Our findings highlight the promise and limitations of traditional evaluation metrics for sensitive health topics.
- Score: 7.156867036177255
- License:
- Abstract: The integration of Large Language Models (LLMs) into healthcare settings has gained significant attention, particularly for question-answering tasks. Given the high-stakes nature of healthcare, it is essential to ensure that LLM-generated content is accurate and reliable to prevent adverse outcomes. However, the development of robust evaluation metrics and methodologies remains a matter of much debate. We examine the performance of publicly available LLM-based chatbots for menopause-related queries, using a mixed-methods approach to evaluate safety, consensus, objectivity, reproducibility, and explainability. Our findings highlight the promise and limitations of traditional evaluation metrics for sensitive health topics. We propose the need for customized and ethically grounded evaluation frameworks to assess LLMs to advance safe and effective use in healthcare.
Related papers
- Value Compass Leaderboard: A Platform for Fundamental and Validated Evaluation of LLMs Values [76.70893269183684]
Large Language Models (LLMs) achieve remarkable breakthroughs, aligning their values with humans has become imperative.
Existing evaluations focus narrowly on safety risks such as bias and toxicity.
Existing benchmarks are prone to data contamination.
The pluralistic nature of human values across individuals and cultures is largely ignored in measuring LLMs value alignment.
arXiv Detail & Related papers (2025-01-13T05:53:56Z) - LlaMADRS: Prompting Large Language Models for Interview-Based Depression Assessment [75.44934940580112]
This study introduces LlaMADRS, a novel framework leveraging open-source Large Language Models (LLMs) to automate depression severity assessment.
We employ a zero-shot prompting strategy with carefully designed cues to guide the model in interpreting and scoring transcribed clinical interviews.
Our approach, tested on 236 real-world interviews, demonstrates strong correlations with clinician assessments.
arXiv Detail & Related papers (2025-01-07T08:49:04Z) - Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering [70.44269982045415]
Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs)
We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets.
Our experimental results reveals current models' limited ability to handle noise and misinformation in the retrieved documents.
arXiv Detail & Related papers (2024-11-14T06:19:18Z) - Building Trust in Mental Health Chatbots: Safety Metrics and LLM-Based Evaluation Tools [13.34861013664551]
We created an evaluation framework with 100 benchmark questions and ideal responses.
This framework, validated by mental health experts, was tested on a GPT-3.5-turbo-based chatbots.
arXiv Detail & Related papers (2024-08-03T19:57:49Z) - How Reliable are LLMs as Knowledge Bases? Re-thinking Facutality and Consistency [60.25969380388974]
Large Language Models (LLMs) are increasingly explored as knowledge bases (KBs)
Current evaluation methods focus too narrowly on knowledge retention, overlooking other crucial criteria for reliable performance.
We propose new criteria and metrics to quantify factuality and consistency, leading to a final reliability score.
arXiv Detail & Related papers (2024-07-18T15:20:18Z) - A Proposed S.C.O.R.E. Evaluation Framework for Large Language Models : Safety, Consensus, Objectivity, Reproducibility and Explainability [5.924966178563408]
We propose 5 key aspects for evaluation of large language models (LLM)
We suggest that S.C.O.R.E. may form the basis for an evaluation framework for future LLM-based models.
arXiv Detail & Related papers (2024-07-10T13:45:16Z) - A Framework for Human Evaluation of Large Language Models in Healthcare Derived from Literature Review [11.28580626017631]
We highlight a notable need for a standardized and consistent human evaluation approach.
We propose a comprehensive and practical framework for human evaluation of large language models (LLMs)
This framework aims to improve the reliability, generalizability, and applicability of human evaluation of LLMs in different healthcare applications.
arXiv Detail & Related papers (2024-05-04T04:16:07Z) - A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models [20.11590976578911]
Large language models (LLMs) hold promise to serve complex health information needs but also have the potential to introduce harm and exacerbate health disparities.
Reliably evaluating equity-related model failures is a critical step toward developing systems that promote health equity.
We present resources and methodologies for surfacing biases with potential to precipitate equity-related harms in long-form, LLM-generated answers to medical questions.
arXiv Detail & Related papers (2024-03-18T17:56:37Z) - Attribute Structuring Improves LLM-Based Evaluation of Clinical Text Summaries [56.31117605097345]
Large language models (LLMs) have shown the potential to generate accurate clinical text summaries, but still struggle with issues regarding grounding and evaluation.
Here, we explore a general mitigation framework using Attribute Structuring (AS), which structures the summary evaluation process.
AS consistently improves the correspondence between human annotations and automated metrics in clinical text summarization.
arXiv Detail & Related papers (2024-03-01T21:59:03Z) - KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models [53.84677081899392]
KIEval is a Knowledge-grounded Interactive Evaluation framework for large language models.
It incorporates an LLM-powered "interactor" role for the first time to accomplish a dynamic contamination-resilient evaluation.
Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization.
arXiv Detail & Related papers (2024-02-23T01:30:39Z) - Foundation Metrics for Evaluating Effectiveness of Healthcare
Conversations Powered by Generative AI [38.497288024393065]
Generative Artificial Intelligence is set to revolutionize healthcare delivery by transforming traditional patient care into a more personalized, efficient, and proactive process.
This paper explores state-of-the-art evaluation metrics that are specifically applicable to the assessment of interactive conversational models in healthcare.
arXiv Detail & Related papers (2023-09-21T19:36:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.