AgentScore: Autoformulation of Deployable Clinical Scoring Systems
- URL: http://arxiv.org/abs/2601.22324v1
- Date: Thu, 29 Jan 2026 21:11:06 GMT
- Title: AgentScore: Autoformulation of Deployable Clinical Scoring Systems
- Authors: Silas Ruhrberg Estévez, Christopher Chiu, Mihaela van der Schaar,
- Abstract summary: We introduce AgentScore, which performs semantically guided optimization in unit-weighted clinical checklists.<n>AgentScore outperforms existing score-generation methods and achieves AUC comparable to more flexible interpretable models.<n>On two additional externally validated tasks, AgentScore achieves higher discrimination than established guideline-based scores.
- Score: 45.88028371034407
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern clinical practice relies on evidence-based guidelines implemented as compact scoring systems composed of a small number of interpretable decision rules. While machine-learning models achieve strong performance, many fail to translate into routine clinical use due to misalignment with workflow constraints such as memorability, auditability, and bedside execution. We argue that this gap arises not from insufficient predictive power, but from optimizing over model classes that are incompatible with guideline deployment. Deployable guidelines often take the form of unit-weighted clinical checklists, formed by thresholding the sum of binary rules, but learning such scores requires searching an exponentially large discrete space of possible rule sets. We introduce AgentScore, which performs semantically guided optimization in this space by using LLMs to propose candidate rules and a deterministic, data-grounded verification-and-selection loop to enforce statistical validity and deployability constraints. Across eight clinical prediction tasks, AgentScore outperforms existing score-generation methods and achieves AUC comparable to more flexible interpretable models despite operating under stronger structural constraints. On two additional externally validated tasks, AgentScore achieves higher discrimination than established guideline-based scores.
Related papers
- Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification [60.18369393468405]
Existing verifiers usually underperform owing to a lack of domain knowledge and limited calibration.<n>GLEAN compiles expert-curated protocols into trajectory-informed, well-calibrated correctness signals.<n>We empirically validate GLEAN with agentic clinical diagnosis across three diseases from the MIMIC-IV dataset.
arXiv Detail & Related papers (2026-03-03T09:36:43Z) - Explicit Uncertainty Modeling for Active CLIP Adaptation with Dual Prompt Tuning [51.99383151474742]
We propose a robust uncertainty modeling framework for active CLIP adaptation based on dual-prompt tuning.<n>We show that our method consistently outperforms existing active learning methods under the same annotation budget.
arXiv Detail & Related papers (2026-02-04T09:01:55Z) - POET: Protocol Optimization via Eligibility Tuning [1.4267510278572033]
We propose a guided generation framework that introduces interpretable semantic axes to steer EC generation.<n>These axes offer a middle ground between specificity and usability, enabling clinicians to guide generation without specifying exact entities.<n>Our results show that our guided generation approach consistently outperforms unguided generation in both automatic, rubric-based and clinician evaluations.
arXiv Detail & Related papers (2026-01-30T22:32:43Z) - Scalably Enhancing the Clinical Validity of a Task Benchmark with Physician Oversight [5.202988483354374]
In this work, we propose viewing benchmarks for complex tasks as ''in-progress living documents'' that should be periodically re-evaluated.<n>We introduce a systematic, physician-in-the-loop pipeline that leverages advanced agentic verifiers to audit and relabel MedCalc-Bench.<n>Our audit reveals that a notable fraction of original labels diverge from medical ground truth due to extraction errors, calculator logic mismatches, and clinical ambiguity.
arXiv Detail & Related papers (2025-12-22T18:59:34Z) - Calibratable Disambiguation Loss for Multi-Instance Partial-Label Learning [53.9713678229744]
Multi-instance partial-label learning (MIPL) is a weakly supervised framework that addresses the challenges of inexact supervision in both instance and label spaces.<n>Existing MIPL approaches often suffer from poor calibration, undermining reliability.<n>We propose a plug-and-play calibratable disambiguation loss (CDL) that simultaneously improves classification accuracy and calibration performance.
arXiv Detail & Related papers (2025-12-19T16:58:31Z) - Intervention Efficiency and Perturbation Validation Framework: Capacity-Aware and Robust Clinical Model Selection under the Rashomon Effect [8.16102315566872]
coexistence of multiple models with comparable performance poses fundamental challenges for trustworthy deployment and evaluation.<n>We propose two complementary tools for robust model assessment and selection: Intervention Efficiency (IE) and the Perturbation Validation Framework (PVF)<n>IE is a capacity-aware metric that quantifies how efficiently a model identifies actionable true positives when only limited interventions are feasible.<n>PVF introduces a structured approach to assess the stability of models under data perturbations, identifying models whose performance remains most invariant across noisy or shifted validation sets.
arXiv Detail & Related papers (2025-11-18T10:21:07Z) - Timely Clinical Diagnosis through Active Test Selection [49.091903570068155]
We propose ACTMED (Adaptive Clinical Test selection via Model-based Experimental Design) to better emulate real-world diagnostic reasoning.<n>LLMs act as flexible simulators, generating plausible patient state distributions and supporting belief updates without requiring structured, task-specific training data.<n>We evaluate ACTMED on real-world datasets and show it can optimize test selection to improve diagnostic accuracy, interpretability, and resource use.
arXiv Detail & Related papers (2025-10-21T18:10:45Z) - Toward Reliable Clinical Coding with Language Models: Verification and Lightweight Adaptation [3.952186976672079]
We show that lightweight interventions, including prompt engineering and small-scale fine-tuning, can improve accuracy without the computational overhead of search-based methods.<n>To address hierarchically near-miss errors, we introduce clinical code verification as both a standalone task and a pipeline component.
arXiv Detail & Related papers (2025-10-08T23:50:58Z) - Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models [52.2001050216955]
Existing methods aim to enhance the performance of Medical Vision Language Model (MedVLM) by adjusting model structure, fine-tuning with high-quality data, or through preference fine-tuning.<n>We propose an expert-in-the-loop framework named Expert-Controlled-Free Guidance (Expert-CFG) to align MedVLM with clinical expertise without additional training.
arXiv Detail & Related papers (2025-07-12T09:03:30Z) - Q-Learning with Clustered-SMART (cSMART) Data: Examining Moderators in the Construction of Clustered Adaptive Interventions [3.9650359172757743]
A clustered adaptive intervention (cAI) is a sequence of decision rules that guides practitioners on how best to tailor cluster-level intervention to improve outcomes.<n>We introduce a clustered Q-learning framework with the M-out-of-N Cluster Bootstrap to evaluate whether a set of candidate tailoring variables may be useful in defining an optimal cAI.
arXiv Detail & Related papers (2025-05-01T19:24:39Z) - Attribute Controlled Fine-tuning for Large Language Models: A Case Study on Detoxification [76.14641982122696]
We propose a constraint learning schema for fine-tuning Large Language Models (LLMs) with attribute control.
We show that our approach leads to an LLM that produces fewer inappropriate responses while achieving competitive performance on benchmarks and a toxicity detection task.
arXiv Detail & Related papers (2024-10-07T23:38:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.