Related papers: MedMCP-Calc: Benchmarking LLMs for Realistic Medical Calculator Scenarios via MCP Integration

MedMCP-Calc: Benchmarking LLMs for Realistic Medical Calculator Scenarios via MCP Integration

URL: http://arxiv.org/abs/2601.23049v1
Date: Fri, 30 Jan 2026 14:56:20 GMT
Title: MedMCP-Calc: Benchmarking LLMs for Realistic Medical Calculator Scenarios via MCP Integration
Authors: Yakun Zhu, Yutong Huang, Shengqian Qin, Zhongzhen Huang, Shaoting Zhang, Xiaofan Zhang,
Abstract summary: We introduce MedMCP-Calc, the first benchmark for evaluating medical calculator scenarios through Model Context Protocol (MCP) integration.<n>MedMCP-Calc comprises 118 scenario tasks across 4 clinical domains, featuring fuzzy task descriptions mimicking natural queries, structured database interaction, external reference retrieval, and process-level evaluation.<n>We develop CalcMate, a fine-tuned model incorporating scenario planning and tool augmentation, achieving state-of-the-art performance among open-source models.
Score: 17.39421062613435
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Medical calculators are fundamental to quantitative, evidence-based clinical practice. However, their real-world use is an adaptive, multi-stage process, requiring proactive EHR data acquisition, scenario-dependent calculator selection, and multi-step computation, whereas current benchmarks focus only on static single-step calculations with explicit instructions. To address these limitations, we introduce MedMCP-Calc, the first benchmark for evaluating LLMs in realistic medical calculator scenarios through Model Context Protocol (MCP) integration. MedMCP-Calc comprises 118 scenario tasks across 4 clinical domains, featuring fuzzy task descriptions mimicking natural queries, structured EHR database interaction, external reference retrieval, and process-level evaluation. Our evaluation of 23 leading models reveals critical limitations: even top performers like Claude Opus 4.5 exhibit substantial gaps, including difficulty selecting appropriate calculators for end-to-end workflows given fuzzy queries, poor performance in iterative SQL-based database interactions, and marked reluctance to leverage external tools for numerical computation. Performance also varies considerably across clinical domains. Building on these findings, we develop CalcMate, a fine-tuned model incorporating scenario planning and tool augmentation, achieving state-of-the-art performance among open-source models. Benchmark and Codes are available in https://github.com/SPIRAL-MED/MedMCP-Calc.

Related papers

A Federated and Parameter-Efficient Framework for Large Language Model Training in Medicine [59.78991974851707]
Large language models (LLMs) have demonstrated strong performance on medical benchmarks, including question answering and diagnosis.<n>Most medical LLMs are trained on data from a single institution, which faces limitations in generalizability and safety in heterogeneous systems.<n>We introduce the model-agnostic and parameter-efficient federated learning framework for adapting LLMs to medical applications.
arXiv Detail & Related papers (2026-01-29T18:48:21Z)
MedCalc-Eval and MedCalc-Env: Advancing Medical Calculation Capabilities of Large Language Models [12.35019345259966]
We introduce MedCalc-Eval, the largest benchmark for assessing large language models' medical calculation abilities.<n>These tasks span diverse specialties including internal medicine, surgery, pediatrics, and cardiology, offering a broader and more challenging evaluation setting.<n>We further develop MedCalc-Env, a reinforcement learning environment built on the InternBootcamp framework, enabling multi-step clinical reasoning and planning.
arXiv Detail & Related papers (2025-10-31T08:07:16Z)
From Scores to Steps: Diagnosing and Improving LLM Performance in Evidence-Based Medical Calculations [45.414878840652115]
Large language models (LLMs) have demonstrated promising performance on medical benchmarks.<n>However, their ability to perform medical calculations remains underexplored and poorly evaluated.<n>In this work, we revisit medical calculation evaluation with a stronger focus on clinical trustworthiness.
arXiv Detail & Related papers (2025-09-20T09:10:26Z)
EHR-MCP: Real-world Evaluation of Clinical Information Retrieval by Large Language Models via Model Context Protocol [0.0]
Large language models (LLMs) show promise in medicine, but their deployment in hospitals is limited by restricted access to electronic health record (EHR) systems.<n>The Model Context Protocol (MCP) enables integration between LLMs and external tools.<n>We developed EHR-MCP, a framework of custom MCP tools integrated with the hospital EHR database, and used GPT-4.1 through a LangGraph ReAct agent to interact with it.
arXiv Detail & Related papers (2025-09-19T13:17:16Z)
LlaMADRS: Prompting Large Language Models for Interview-Based Depression Assessment [75.44934940580112]
This study introduces LlaMADRS, a novel framework leveraging open-source Large Language Models (LLMs) to automate depression severity assessment.<n>We employ a zero-shot prompting strategy with carefully designed cues to guide the model in interpreting and scoring transcribed clinical interviews.<n>Our approach, tested on 236 real-world interviews, demonstrates strong correlations with clinician assessments.
arXiv Detail & Related papers (2025-01-07T08:49:04Z)
MeNTi: Bridging Medical Calculator and LLM Agent with Nested Tool Calling [12.236137157144965]
We introduce MeNTi, a universal agent architecture for Large Language Models (LLMs)<n>MeNTi integrates a specialized medical toolkit and employs meta-tool and nested calling mechanisms to enhance LLM tool utilization.<n>To assess the capabilities of LLMs for quantitative assessment throughout the clinical process of calculator scenarios, we introduce CalcQA.
arXiv Detail & Related papers (2024-10-17T14:46:22Z)
MedCalc-Bench: Evaluating Large Language Models for Medical Calculations [18.8552481902506]
Current benchmarks for evaluating large language models (LLMs) in medicine are primarily focused on question-answering involving domain knowledge and descriptive reasoning. We propose MedCalc-Bench, a first-of-its-kind dataset focused on evaluating the medical calculation capability of LLMs.
arXiv Detail & Related papers (2024-06-17T19:07:21Z)
Interpretable Medical Diagnostics with Structured Data Extraction by Large Language Models [59.89454513692417]
Tabular data is often hidden in text, particularly in medical diagnostic reports. We propose a novel, simple, and effective methodology for extracting structured tabular data from textual medical reports, called TEMED-LLM. We demonstrate that our approach significantly outperforms state-of-the-art text classification models in medical diagnostics.
arXiv Detail & Related papers (2023-06-08T09:12:28Z)
Time Associated Meta Learning for Clinical Prediction [78.99422473394029]
We propose a novel time associated meta learning (TAML) method to make effective predictions at multiple future time points. To address the sparsity problem after task splitting, TAML employs a temporal information sharing strategy to augment the number of positive samples. We demonstrate the effectiveness of TAML on multiple clinical datasets, where it consistently outperforms a range of strong baselines.
arXiv Detail & Related papers (2023-03-05T03:54:54Z)
Competence-based Multimodal Curriculum Learning for Medical Report Generation [98.10763792453925]
We propose a Competence-based Multimodal Curriculum Learning framework ( CMCL) to alleviate the data bias and make best use of available data. Specifically, CMCL simulates the learning process of radiologists and optimize the model in a step by step manner. Experiments on the public IU-Xray and MIMIC-CXR datasets show that CMCL can be incorporated into existing models to improve their performance.
arXiv Detail & Related papers (2022-06-24T08:16:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.