Retrieval-Augmented Clinical Benchmarking for Contextual Model Testing in Kenyan Primary Care: A Methodology Paper
- URL: http://arxiv.org/abs/2507.14615v1
- Date: Sat, 19 Jul 2025 13:25:26 GMT
- Title: Retrieval-Augmented Clinical Benchmarking for Contextual Model Testing in Kenyan Primary Care: A Methodology Paper
- Authors: Fred Mutisya, Shikoh Gitau, Christine Syovata, Diana Oigara, Ibrahim Matende, Muna Aden, Munira Ali, Ryan Nyotu, Diana Marion, Job Nyangena, Nasubo Ongoma, Keith Mbae, Elizabeth Wamicha, Eric Mibuari, Jean Philbert Nsengemana, Talkmore Chidede,
- Abstract summary: Large Language Models (LLMs) hold promise for improving healthcare access in low-resource settings, but their effectiveness in African primary care remains underexplored.<n>We present a methodology for creating a benchmark dataset and evaluation framework focused on Kenyan Level 2 and 3 clinical care.<n>Our approach uses retrieval augmented generation (RAG) to ground clinical questions in Kenya's national guidelines, ensuring alignment with local standards.
- Score: 0.609562679184219
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large Language Models(LLMs) hold promise for improving healthcare access in low-resource settings, but their effectiveness in African primary care remains underexplored. We present a methodology for creating a benchmark dataset and evaluation framework focused on Kenyan Level 2 and 3 clinical care. Our approach uses retrieval augmented generation (RAG) to ground clinical questions in Kenya's national guidelines, ensuring alignment with local standards. These guidelines were digitized, chunked, and indexed for semantic retrieval. Gemini Flash 2.0 Lite was then prompted with guideline excerpts to generate realistic clinical scenarios, multiple-choice questions, and rationale based answers in English and Swahili. Kenyan physicians co-created and refined the dataset, and a blinded expert review process ensured clinical accuracy, clarity, and cultural appropriateness. The resulting Alama Health QA dataset includes thousands of regulator-aligned question answer pairs across common outpatient conditions. Beyond accuracy, we introduce evaluation metrics that test clinical reasoning, safety, and adaptability such as rare case detection (Needle in the Haystack), stepwise logic (Decision Points), and contextual adaptability. Initial results reveal significant performance gaps when LLMs are applied to localized scenarios, consistent with findings that LLM accuracy is lower on African medical content than on US-based benchmarks. This work offers a replicable model for guideline-driven, dynamic benchmarking to support safe AI deployment in African health systems.
Related papers
- Rethinking Evidence Hierarchies in Medical Language Benchmarks: A Critical Evaluation of HealthBench [0.0]
HealthBench is a benchmark designed to measure the capabilities of AI systems for health better.<n>Its reliance on expert opinion, rather than high-tier clinical evidence, risks codifying regional biases and individual clinician idiosyncrasies.<n>We propose anchoring reward functions in version-controlled Clinical Practice Guidelines that incorporate systematic reviews and GRADE evidence ratings.
arXiv Detail & Related papers (2025-07-31T18:16:10Z) - Mind the Gap: Evaluating the Representativeness of Quantitative Medical Language Reasoning LLM Benchmarks for African Disease Burdens [0.609562679184219]
Existing medical LLM benchmarks largely reflect examination syllabi and disease profiles from high income settings.<n>Alama Health QA was developed using a retrieval augmented generation framework anchored on the Kenyan Clinical Practice Guidelines.<n>Alama scored highest for relevance and guideline alignment; PubMedQA lowest for clinical utility.
arXiv Detail & Related papers (2025-07-22T08:05:30Z) - Med-CoDE: Medical Critique based Disagreement Evaluation Framework [72.42301910238861]
The reliability and accuracy of large language models (LLMs) in medical contexts remain critical concerns.<n>Current evaluation methods often lack robustness and fail to provide a comprehensive assessment of LLM performance.<n>We propose Med-CoDE, a specifically designed evaluation framework for medical LLMs to address these challenges.
arXiv Detail & Related papers (2025-04-21T16:51:11Z) - Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.<n>We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.<n>Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z) - SemioLLM: Evaluating Large Language Models for Diagnostic Reasoning from Unstructured Clinical Narratives in Epilepsy [45.2233252981348]
Large Language Models (LLMs) have been shown to encode clinical knowledge.<n>We present SemioLLM, an evaluation framework that benchmarks 6 state-of-the-art models.<n>We show that most LLMs are able to accurately and confidently generate probabilistic predictions of seizure onset zones in the brain.
arXiv Detail & Related papers (2024-07-03T11:02:12Z) - Large Language Models in the Clinic: A Comprehensive Benchmark [63.21278434331952]
We build a benchmark ClinicBench to better understand large language models (LLMs) in the clinic.
We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks.
We then construct six novel datasets and clinical tasks that are complex but common in real-world practice.
We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings.
arXiv Detail & Related papers (2024-04-25T15:51:06Z) - Attribute Structuring Improves LLM-Based Evaluation of Clinical Text Summaries [56.31117605097345]
Large language models (LLMs) have shown the potential to generate accurate clinical text summaries, but still struggle with issues regarding grounding and evaluation.<n>Here, we explore a general mitigation framework using Attribute Structuring (AS), which structures the summary evaluation process.<n>AS consistently improves the correspondence between human annotations and automated metrics in clinical text summarization.
arXiv Detail & Related papers (2024-03-01T21:59:03Z) - AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs.
This setup allows for realistic assessments of LLMs in clinical scenarios.
We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z) - Zero-Shot Clinical Trial Patient Matching with LLMs [40.31971412825736]
Large language models (LLMs) offer a promising solution to automated screening.
We design an LLM-based system which, given a patient's medical history as unstructured clinical text, evaluates whether that patient meets a set of inclusion criteria.
Our system achieves state-of-the-art scores on the n2c2 2018 cohort selection benchmark.
arXiv Detail & Related papers (2024-02-05T00:06:08Z) - Benchmarking Automated Clinical Language Simplification: Dataset,
Algorithm, and Evaluation [48.87254340298189]
We construct a new dataset named MedLane to support the development and evaluation of automated clinical language simplification approaches.
We propose a new model called DECLARE that follows the human annotation procedure and achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-12-04T06:09:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.