Toward Human-Centered Readability Evaluation
- URL: http://arxiv.org/abs/2510.10801v1
- Date: Sun, 12 Oct 2025 20:38:32 GMT
- Title: Toward Human-Centered Readability Evaluation
- Authors: Bahar İlgen, Georges Hattab,
- Abstract summary: The Human-Centered Readability Score (HCRS) is a five-dimensional evaluation framework grounded in Human-Computer Interaction (HCI) and health communication research.<n>HCRS integrates automatic measures with structured human feedback to capture readability and contextual aspects of readability.<n>This work aims to advance the evaluation of health text simplification beyond surface metrics, enabling NLP systems that align more closely with diverse users' needs, expectations, and lived experiences.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text simplification is essential for making public health information accessible to diverse populations, including those with limited health literacy. However, commonly used evaluation metrics in Natural Language Processing (NLP), such as BLEU, FKGL, and SARI, mainly capture surface-level features and fail to account for human-centered qualities like clarity, trustworthiness, tone, cultural relevance, and actionability. This limitation is particularly critical in high-stakes health contexts, where communication must be not only simple but also usable, respectful, and trustworthy. To address this gap, we propose the Human-Centered Readability Score (HCRS), a five-dimensional evaluation framework grounded in Human-Computer Interaction (HCI) and health communication research. HCRS integrates automatic measures with structured human feedback to capture the relational and contextual aspects of readability. We outline the framework, discuss its integration into participatory evaluation workflows, and present a protocol for empirical validation. This work aims to advance the evaluation of health text simplification beyond surface metrics, enabling NLP systems that align more closely with diverse users' needs, expectations, and lived experiences.
Related papers
- Responsible Evaluation of AI for Mental Health [72.85175110624736]
Current approaches to evaluating AI tools in mental health care are fragmented and poorly aligned with clinical practice, social context, and first-hand user experience.<n>This paper argues for a rethinking of responsible evaluation by introducing an interdisciplinary framework that integrates clinical soundness, social context, and equity.
arXiv Detail & Related papers (2026-01-20T12:55:10Z) - Faithful Summarization of Consumer Health Queries: A Cross-Lingual Framework with LLMs [0.0]
We propose a framework that combines TextRank-based sentence extraction and medical named entity recognition.<n>We fine-tuned the LLaMA-2-7B model on the MeQSum (English) and BanglaCHQ-Summ (Bangla) datasets.<n>Human evaluation shows that over 80% of generated summaries preserve critical medical information.
arXiv Detail & Related papers (2025-11-13T19:42:11Z) - A Scalable Framework for Evaluating Health Language Models [16.253655494186905]
Large language models (LLMs) have emerged as powerful tools for analyzing complex datasets.<n>Current evaluation practices for open-ended text responses heavily rely on human experts.<n>This work introduces Adaptive Precise Boolean rubrics: an evaluation framework that streamlines human and automated evaluation of open-ended questions.
arXiv Detail & Related papers (2025-03-30T06:47:57Z) - A Mixed-Methods Evaluation of LLM-Based Chatbots for Menopause [7.156867036177255]
The integration of Large Language Models (LLMs) into healthcare settings has gained significant attention.<n>We examine the performance of publicly available LLM-based chatbots for menopause-related queries.<n>Our findings highlight the promise and limitations of traditional evaluation metrics for sensitive health topics.
arXiv Detail & Related papers (2025-02-05T19:56:52Z) - Building Trust in Mental Health Chatbots: Safety Metrics and LLM-Based Evaluation Tools [13.386012271835039]
We created an evaluation framework with 100 benchmark questions and ideal responses.<n>This framework, validated by mental health experts, was tested on a GPT-3.5-turbo-based chatbots.
arXiv Detail & Related papers (2024-08-03T19:57:49Z) - Rel-A.I.: An Interaction-Centered Approach To Measuring Human-LM Reliance [73.19687314438133]
We study how reliance is affected by contextual features of an interaction.
We find that contextual characteristics significantly affect human reliance behavior.
Our results show that calibration and language quality alone are insufficient in evaluating the risks of human-LM interactions.
arXiv Detail & Related papers (2024-07-10T18:00:05Z) - ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models [53.00812898384698]
We argue that human evaluation of generative large language models (LLMs) should be a multidisciplinary undertaking.
We highlight how cognitive biases can conflate fluent information and truthfulness, and how cognitive uncertainty affects the reliability of rating scores such as Likert.
We propose the ConSiDERS-The-Human evaluation framework consisting of 6 pillars -- Consistency, Scoring Criteria, Differentiating, User Experience, Responsible, and Scalability.
arXiv Detail & Related papers (2024-05-28T22:45:28Z) - Attribute Structuring Improves LLM-Based Evaluation of Clinical Text Summaries [56.31117605097345]
Large language models (LLMs) have shown the potential to generate accurate clinical text summaries, but still struggle with issues regarding grounding and evaluation.<n>Here, we explore a general mitigation framework using Attribute Structuring (AS), which structures the summary evaluation process.<n>AS consistently improves the correspondence between human annotations and automated metrics in clinical text summarization.
arXiv Detail & Related papers (2024-03-01T21:59:03Z) - AntEval: Evaluation of Social Interaction Competencies in LLM-Driven
Agents [65.16893197330589]
Large Language Models (LLMs) have demonstrated their ability to replicate human behaviors across a wide range of scenarios.
However, their capability in handling complex, multi-character social interactions has yet to be fully explored.
We introduce the Multi-Agent Interaction Evaluation Framework (AntEval), encompassing a novel interaction framework and evaluation methods.
arXiv Detail & Related papers (2024-01-12T11:18:00Z) - Foundation Metrics for Evaluating Effectiveness of Healthcare
Conversations Powered by Generative AI [38.497288024393065]
Generative Artificial Intelligence is set to revolutionize healthcare delivery by transforming traditional patient care into a more personalized, efficient, and proactive process.
This paper explores state-of-the-art evaluation metrics that are specifically applicable to the assessment of interactive conversational models in healthcare.
arXiv Detail & Related papers (2023-09-21T19:36:48Z) - Learning and Evaluating Human Preferences for Conversational Head
Generation [101.89332968344102]
We propose a novel learning-based evaluation metric named Preference Score (PS) for fitting human preference according to the quantitative evaluations across different dimensions.
PS can serve as a quantitative evaluation without the need for human annotation.
arXiv Detail & Related papers (2023-07-20T07:04:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.