Related papers: Automatic Replication of LLM Mistakes in Medical Conversations

Automatic Replication of LLM Mistakes in Medical Conversations

URL: http://arxiv.org/abs/2512.20983v1
Date: Wed, 24 Dec 2025 06:17:21 GMT
Title: Automatic Replication of LLM Mistakes in Medical Conversations
Authors: Oleksii Proniakin, Diego Fajardo, Ruslan Nazarenko, Razvan Marinescu,
Abstract summary: We introduce MedMistake, an automatic pipeline that extracts mistakes LLMs make in patient-doctor conversations and converts them into a benchmark of single-shot QA pairs.<n>We release MedMistake-All, a dataset of 3,390 single-shot QA pairs where GPT-5 and Gemini 2.5 Pro are currently failing to answer correctly, as judged by two LLM judges.<n>We found that GPT models, Claude and Grok obtained the best performance on MedMistake-Bench.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are increasingly evaluated in clinical settings using multi-dimensional rubrics which quantify reasoning quality, safety, and patient-centeredness. Yet, replicating specific mistakes in other LLM models is not straightforward and often requires manual effort. We introduce MedMistake, an automatic pipeline that extracts mistakes LLMs make in patient-doctor conversations and converts them into a benchmark of single-shot QA pairs. Our pipeline (1) creates complex, conversational data between an LLM patient and LLM doctor, (2) runs an evaluation with a committee of 2 LLM judges across a variety of dimensions and (3) creates simplified single-shot QA scenarios from those mistakes. We release MedMistake-All, a dataset of 3,390 single-shot QA pairs where GPT-5 and Gemini 2.5 Pro are currently failing to answer correctly, as judged by two LLM judges. We used medical experts to validate a subset of 211/3390 questions (MedMistake-Bench), which we used to run a final evaluation of 12 frontier LLMs: Claude Opus 4.5, Claude Sonnet 4.5, DeepSeek-Chat, Gemini 2.5 Pro, Gemini 3 Pro, GPT-4o, GPT-5, GPT-5.1, GPT-5.2, Grok 4, Grok 4.1, Mistral Large. We found that GPT models, Claude and Grok obtained the best performance on MedMistake-Bench. We release both the doctor-validated benchmark (MedMistake-Bench), as well as the full dataset (MedMistake-All) at https://huggingface.co/datasets/TheLumos/MedicalMistakeBenchmark.

Related papers

Benchmarking Motivational Interviewing Competence of Large Language Models [3.640688858400333]
Motivational interviewing (MI) promotes behavioural change in substance use disorders.<n>Its fidelity is measured using the Motivational Interviewing Treatment Integrity (MITI) framework.<n>We shortlisted 3 proprietary and 7 open-source LLMs from LMArena, evaluated performance using MITI 4.2 framework on two datasets.<n>We conducted a distinguishability experiment with two independent psychiatrists to identify human-vs-LLM responses.
arXiv Detail & Related papers (2026-03-04T08:56:37Z)
When Metrics Disagree: Automatic Similarity vs. LLM-as-a-Judge for Clinical Dialogue Evaluation [18.338933046286257]
Large language models (LLMs) are increasingly employed to address diverse problems, including medical queries.<n>LLMs often perform poorly in medical contexts, potentially leading to harmful misguidance for users.<n>This paper focuses on fine-tuning the Llama 2 7B, a transformer-based, decoder-only model, using transcripts from real patient-doctor interactions.
arXiv Detail & Related papers (2026-02-27T21:09:43Z)
MedPI: Evaluating AI Systems in Medical Patient-facing Interactions [0.0]
We present MedPI, a high-dimensional benchmark for evaluating large language models (LLMs) in patient-clinician conversations.<n>MedPI evaluates the medical dialogue across 105 dimensions comprising the medical process, treatment safety, treatment outcomes and doctor-patient communication.<n>We evaluate 9 flagship models -- Claude Opus 4.1, Claude Sonnet 4, MedGemma, Gemini 2.5 Pro, Llama 3.3 70b Instruct, GPT-5, GPT OSS 120b, o3, Grok-4 -- across 366 AI Patients and 7,097 conversations.
arXiv Detail & Related papers (2025-12-02T19:10:06Z)
Demo: Statistically Significant Results On Biases and Errors of LLMs Do Not Guarantee Generalizable Results [10.858989372235657]
We develop an infrastructure that automatically generates queries to probe LLMs and 2) evaluates answers to these queries using multiple LLM-as-a-judge setups and prompts.<n>As a baseline study, we perform two case studies on inter-LLM agreement and the impact of varying the answering and evaluation LLMs.
arXiv Detail & Related papers (2025-11-04T04:20:33Z)
MedVAL: Toward Expert-Level Medical Text Validation with Language Models [19.885282576644077]
There is an immediate need to evaluate the accuracy and safety of LM-generated medical text.<n>Currently, such evaluation relies on solely manual physician review.<n>We propose MedVAL, a novel, self-supervised, data-efficient distillation method that leverages synthetic data to train evaluators.
arXiv Detail & Related papers (2025-07-03T20:19:18Z)
LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation [58.25892575437433]
evaluating large language models (LLMs) in medicine is crucial because medical applications require high accuracy with little room for error.<n>We present LLMEval-Med, a new benchmark covering five core medical areas, including 2,996 questions created from real-world electronic health records and expert-designed clinical scenarios.
arXiv Detail & Related papers (2025-06-04T15:43:14Z)
DeepCritic: Deliberate Critique with Large Language Models [77.5516314477878]
We focus on studying and enhancing the math critique ability of Large Language Models (LLMs)<n>Our developed critique model built on Qwen2.5-7B-Instruct significantly outperforms existing LLM critics on various error identification benchmarks.
arXiv Detail & Related papers (2025-05-01T17:03:17Z)
GeoBenchX: Benchmarking LLMs in Agent Solving Multistep Geospatial Tasks [0.11458853556386796]
This paper establishes a benchmark for evaluating tool-calling capabilities of large language models (LLMs)<n>We assess eight commercial LLMs (Claude Sonnet 3.5 and 4, Claude Haiku 3.5, Gemini 2.0 Flash, Gemini 2.5 Pro Preview, GPT-4o, GPT-4.1 and o4-mini) using a simple tool-calling agent equipped with 23 geospatial functions.<n>Results show o4-mini and Claude 3.5 Sonnet achieve the best overall performance, OpenAI's GPT-4.1, GPT-4o and Google's Gemini 2.5 Pro Preview do not fall far behind, but the last two are more efficient in
arXiv Detail & Related papers (2025-03-23T16:20:14Z)
LLM Robustness Against Misinformation in Biomedical Question Answering [50.98256373698759]
The retrieval-augmented generation (RAG) approach is used to reduce the confabulation of large language models (LLMs) for question answering. We evaluate the effectiveness and robustness of four LLMs against misinformation in answering biomedical questions.
arXiv Detail & Related papers (2024-10-27T16:23:26Z)
Towards Evaluating and Building Versatile Large Language Models for Medicine [57.49547766838095]
We present MedS-Bench, a benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts. MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation. MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks.
arXiv Detail & Related papers (2024-08-22T17:01:34Z)
See What LLMs Cannot Answer: A Self-Challenge Framework for Uncovering LLM Weaknesses [51.975495361024606]
We propose a Self-Challenge evaluation framework with human-in-the-loop. Starting from seed instances that GPT-4 fails to answer, we prompt GPT-4 to summarize error patterns that can be used to generate new instances. We then build a benchmark, SC-G4, consisting of 1,835 instances generated by GPT-4 using these patterns, with human-annotated gold responses.
arXiv Detail & Related papers (2024-08-16T19:01:52Z)
A Continued Pretrained LLM Approach for Automatic Medical Note Generation [10.981182525560751]
We introduce HEAL, the first continuously trained 13B LLaMA2-based LLM that is purpose-built for medical conversations and measured on automated scribing. Our results demonstrate that HEAL outperforms GPT-4 and PMC-LLaMA in PubMedQA, with an accuracy of 78.4%. Remarkably, HEAL surpasses GPT-4 and Med-PaLM 2 in identifying more correct medical concepts and exceeds the performance of human scribes in correctness and completeness.
arXiv Detail & Related papers (2024-03-14T02:55:37Z)
PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task Completion [96.47420221442397]
We introduce the PowerPoint Task Completion benchmark to assess the ability of Large Language Models to finish multi-turn, multi-modal instructions. We also propose the PPTX-Match Evaluation System that evaluates if LLMs finish the instruction based on the prediction file rather than the label API sequence. The results show that GPT-4 outperforms other LLMs with 75.1% accuracy in single-turn dialogue testing but faces challenges in completing entire sessions, achieving just 6% session accuracy.
arXiv Detail & Related papers (2023-11-03T08:06:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.