Related papers: Beyond the Rubric: Cultural Misalignment in LLM Benchmarks for Sexual and Reproductive Health

Beyond the Rubric: Cultural Misalignment in LLM Benchmarks for Sexual and Reproductive Health

URL: http://arxiv.org/abs/2511.17554v2
Date: Wed, 26 Nov 2025 20:20:13 GMT
Title: Beyond the Rubric: Cultural Misalignment in LLM Benchmarks for Sexual and Reproductive Health
Authors: Sumon Kanti Dey, Manvi S, Zeel Mehta, Meet Shah, Unnati Agrawal, Suhani Jalota, Azra Ismail,
Abstract summary: Large Language Models (LLMs) have been positioned as having the potential to expand access to health information in the Global South.<n>We present insights from a preliminary benchmarking exercise with a chatbots for sexual and reproductive health (SRH) for an underserved community in India.<n>We extracted 637 SRH queries from the dataset and evaluated on the 330 single-turn conversations.<n>Our findings demonstrate the limitations of current benchmarks in capturing the effectiveness of systems built for different cultural and healthcare contexts.
Score: 4.811306010183038
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have been positioned as having the potential to expand access to health information in the Global South, yet their evaluation remains heavily dependent on benchmarks designed around Western norms. We present insights from a preliminary benchmarking exercise with a chatbot for sexual and reproductive health (SRH) for an underserved community in India. We evaluated using HealthBench, a benchmark for conversational health models by OpenAI. We extracted 637 SRH queries from the dataset and evaluated on the 330 single-turn conversations. Responses were evaluated using HealthBench's rubric-based automated grader, which rated responses consistently low. However, qualitative analysis by trained annotators and public health experts revealed that many responses were actually culturally appropriate and medically accurate. We highlight recurring issues, particularly a Western bias, such as for legal framing and norms (e.g., breastfeeding in public), diet assumptions (e.g., fish safe to eat during pregnancy), and costs (e.g., insurance models). Our findings demonstrate the limitations of current benchmarks in capturing the effectiveness of systems built for different cultural and healthcare contexts. We argue for the development of culturally adaptive evaluation frameworks that meet quality standards while recognizing needs of diverse populations.

Related papers

From Knowledge to Inference: Scaling Laws of Specialized Reasoning on GlobalHealthAtlas [1.8594711725515678]
We introduce textbfGlobalHealthAtlas, a large scale multilingual dataset of 280,210 instances spanning 15 public health domains and 17 languages.<n>We propose large language model (LLM) assisted construction and quality control pipeline with retrieval, duplication, evidence grounding checks, and label validation to improve consistency at scale.
arXiv Detail & Related papers (2026-01-31T03:29:30Z)
Health-ORSC-Bench: A Benchmark for Measuring Over-Refusal and Safety Completion in Health Context [82.32380418146656]
Health-ORSC-Bench is the first large-scale benchmark designed to measure textbfOver-Refusal and textbfSafe Completion quality in healthcare.<n>Our framework uses an automated pipeline with human validation to test models at varying levels of intent ambiguity.<n>Health-ORSC-Bench provides a rigorous standard for calibrating the next generation of medical AI assistants.
arXiv Detail & Related papers (2026-01-25T01:28:52Z)
Filling in the Clinical Gaps in Benchmark: Case for HealthBench for the Japanese medical system [5.7880565661958565]
This study investigates the applicability of HealthBench to the Japanese context.<n> resources in Japanese are scarce and often consist of translated multiple-choice questions.
arXiv Detail & Related papers (2025-09-22T07:36:12Z)
MORQA: Benchmarking Evaluation Metrics for Medical Open-Ended Question Answering [11.575146661047368]
We introduce MORQA, a new multilingual benchmark designed to assess the effectiveness of NLG evaluation metrics.<n>We benchmark both traditional metrics and large language model (LLM)-based evaluators, such as GPT-4 and Gemini.<n>Our results provide the first comprehensive, multilingual qualitative study of NLG evaluation in the medical domain.
arXiv Detail & Related papers (2025-09-15T19:51:57Z)
Rethinking Evidence Hierarchies in Medical Language Benchmarks: A Critical Evaluation of HealthBench [0.0]
HealthBench is a benchmark designed to measure the capabilities of AI systems for health better.<n>Its reliance on expert opinion, rather than high-tier clinical evidence, risks codifying regional biases and individual clinician idiosyncrasies.<n>We propose anchoring reward functions in version-controlled Clinical Practice Guidelines that incorporate systematic reviews and GRADE evidence ratings.
arXiv Detail & Related papers (2025-07-31T18:16:10Z)
A Scalable Framework for Evaluating Health Language Models [16.253655494186905]
Large language models (LLMs) have emerged as powerful tools for analyzing complex datasets.<n>Current evaluation practices for open-ended text responses heavily rely on human experts.<n>This work introduces Adaptive Precise Boolean rubrics: an evaluation framework that streamlines human and automated evaluation of open-ended questions.
arXiv Detail & Related papers (2025-03-30T06:47:57Z)
Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering [70.44269982045415]
Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs) We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets. Our experimental results reveals current models' limited ability to handle noise and misinformation in the retrieved documents.
arXiv Detail & Related papers (2024-11-14T06:19:18Z)
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.<n>A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z)
Social Biases in Automatic Evaluation Metrics for NLG [53.76118154594404]
We propose an evaluation method based on Word Embeddings Association Test (WEAT) and Sentence Embeddings Association Test (SEAT) to quantify social biases in evaluation metrics. We construct gender-swapped meta-evaluation datasets to explore the potential impact of gender bias in image caption and text summarization tasks.
arXiv Detail & Related papers (2022-10-17T08:55:26Z)
Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications [85.24952708195582]
This study examines the goals, community practices, assumptions, and constraints that shape NLG evaluations. We examine their implications and how they embody ethical considerations.
arXiv Detail & Related papers (2022-05-13T18:00:11Z)
GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics. Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation. It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.