Related papers: Robustness tests for biomedical foundation models should tailor to specification

Related papers

Score-based Generative Modeling for Conditional Independence Testing [35.0533359302886]
We propose a novel CI testing method via score-based generative modeling, which achieves precise Type I error control and strong testing power.<n>We theoretically establish the error bound of conditional distributions modeled by score-based generative models and prove the validity of our CI tests.
arXiv Detail & Related papers (2025-05-29T10:10:46Z)
Advancing Embodied Agent Security: From Safety Benchmarks to Input Moderation [52.83870601473094]
Embodied agents exhibit immense potential across a multitude of domains. Existing research predominantly concentrates on the security of general large language models. This paper introduces a novel input moderation framework, meticulously designed to safeguard embodied agents.
arXiv Detail & Related papers (2025-04-22T08:34:35Z)
Beyond Black-Box Benchmarking: Observability, Analytics, and Optimization of Agentic Systems [1.415098516077151]
The rise of agentic AI systems, where agents collaborate to perform diverse tasks, poses new challenges with observing, analyzing and optimizing their behavior. Traditional evaluation and benchmarking approaches struggle to handle the non-deterministic, context-sensitive, and dynamic nature of these systems. This paper explores key challenges and opportunities in analyzing and optimizing agentic systems across development, testing, and maintenance.
arXiv Detail & Related papers (2025-03-09T20:02:04Z)
SMRS: advocating a unified reporting standard for surrogate models in the artificial intelligence era [1.4835379864550937]
We argue for the urgent need to establish a structured reporting standard for surrogate models.<n>By promoting a standardised yet flexible framework, we aim to improve the reliability of surrogate modelling.
arXiv Detail & Related papers (2025-02-10T18:31:15Z)
Pitfalls of topology-aware image segmentation [81.19923502845441]
We identify critical pitfalls in model evaluation that include inadequate connectivity choices, overlooked topological artifacts, and inappropriate use of evaluation metrics.<n>We propose a set of actionable recommendations to establish fair and robust evaluation standards for topology-aware medical image segmentation methods.
arXiv Detail & Related papers (2024-12-19T08:11:42Z)
Exposing Assumptions in AI Benchmarks through Cognitive Modelling [0.0]
Cultural AI benchmarks often rely on implicit assumptions about measured constructs, leading to vague formulations with poor validity and unclear interrelations. We propose exposing these assumptions using explicit cognitive models formulated as Structural Equation Models.
arXiv Detail & Related papers (2024-09-25T11:55:02Z)
Beyond One-Time Validation: A Framework for Adaptive Validation of Prognostic and Diagnostic AI-based Medical Devices [55.319842359034546]
Existing approaches often fall short in addressing the complexity of practically deploying these devices. The presented framework emphasizes the importance of repeating validation and fine-tuning during deployment. It is positioned within the current US and EU regulatory landscapes.
arXiv Detail & Related papers (2024-09-07T11:13:52Z)
FEDKIM: Adaptive Federated Knowledge Injection into Medical Foundation Models [54.09244105445476]
This study introduces a novel knowledge injection approach, FedKIM, to scale the medical foundation model within a federated learning framework.<n>FedKIM leverages lightweight local models to extract healthcare knowledge from private data and integrates this knowledge into a centralized foundation model.<n>Our experiments across twelve tasks in seven modalities demonstrate the effectiveness of FedKIM in various settings.
arXiv Detail & Related papers (2024-08-17T15:42:29Z)
Benchmarks as Microscopes: A Call for Model Metrology [76.64402390208576]
Modern language models (LMs) pose a new challenge in capability assessment. To be confident in our metrics, we need a new discipline of model metrology.
arXiv Detail & Related papers (2024-07-22T17:52:12Z)
GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models [56.63218531256961]
We introduce GenBench, a benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models. GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies. We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance.
arXiv Detail & Related papers (2024-06-01T08:01:05Z)
Assessing biomedical knowledge robustness in large language models by query-efficient sampling attacks [0.6282171844772422]
An increasing depth of parametric domain knowledge in large language models (LLMs) is fueling their rapid deployment in real-world applications.<n>The recent discovery of named entities as adversarial examples in natural language processing tasks raises questions about their potential impact on the knowledge robustness of pre-trained and finetuned LLMs.<n>We developed an embedding-space attack based on powerscaled distance-weighted sampling to assess the robustness of their biomedical knowledge.
arXiv Detail & Related papers (2024-02-16T09:29:38Z)
Evaluating General-Purpose AI with Psychometrics [43.85432514910491]
We discuss the need for a comprehensive and accurate evaluation of general-purpose AI systems such as large language models. Current evaluation methodology, mostly based on benchmarks of specific tasks, falls short of adequately assessing these versatile AI systems. To tackle these challenges, we suggest transitioning from task-oriented evaluation to construct-oriented evaluation.
arXiv Detail & Related papers (2023-10-25T05:38:38Z)
Benchmarking Scalable Epistemic Uncertainty Quantification in Organ Segmentation [7.313010190714819]
quantifying uncertainty associated with model predictions is crucial in critical clinical applications. Deep learning based methods for automatic organ segmentation have shown promise in aiding diagnosis and treatment planning. It is unclear which method is preferred in the medical image analysis setting.
arXiv Detail & Related papers (2023-08-15T00:09:33Z)
Geometric Deep Learning for Structure-Based Drug Design: A Survey [83.87489798671155]
Structure-based drug design (SBDD) leverages the three-dimensional geometry of proteins to identify potential drug candidates. Recent advancements in geometric deep learning, which effectively integrate and process 3D geometric data, have significantly propelled the field forward.
arXiv Detail & Related papers (2023-06-20T14:21:58Z)
From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation [60.14902811624433]
We discuss a paradigm shift from static evaluation methods to adaptive testing. This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time. We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)
Benchmarking Model Predictive Control Algorithms in Building Optimization Testing Framework (BOPTEST) [40.17692290400862]
We present a data-driven modeling and control framework for physics-based building emulators. Our approach consists of: (a) Offline training of differentiable surrogate models that accelerate model evaluations, provide cost-effective gradients, and maintain good predictive accuracy for the receding horizon in Model Predictive Control (MPC) We extensively evaluate the modeling and control performance using multiple surrogate models and optimization frameworks across various test cases available in the Building Optimization Testing Framework (BOPTEST)
arXiv Detail & Related papers (2023-01-31T06:55:19Z)
Validation Diagnostics for SBI algorithms based on Normalizing Flows [55.41644538483948]
This work proposes easy to interpret validation diagnostics for multi-dimensional conditional (posterior) density estimators based on NF. It also offers theoretical guarantees based on results of local consistency. This work should help the design of better specified models or drive the development of novel SBI-algorithms.
arXiv Detail & Related papers (2022-11-17T15:48:06Z)
Estimating Test Performance for AI Medical Devices under Distribution Shift with Conformal Prediction [4.395519864600419]
We consider the task of predicting the test accuracy of an arbitrary black-box model on an unlabeled target domain. We propose a "black-box" test estimation technique based on conformal prediction and evaluate it against other methods.
arXiv Detail & Related papers (2022-07-12T19:25:21Z)
Extending Process Discovery with Model Complexity Optimization and Cyclic States Identification: Application to Healthcare Processes [62.997667081978825]
The paper presents an approach to process mining providing semi-automatic support to model optimization. A model simplification approach is proposed, which essentially abstracts the raw model at the desired granularity. We aim to demonstrate the capabilities of the technological solution using three datasets from different applications in the healthcare domain.
arXiv Detail & Related papers (2022-06-10T16:20:59Z)
Calibrating Over-Parametrized Simulation Models: A Framework via Eligibility Set [3.862247454265944]
We develop a framework to develop calibration schemes that satisfy rigorous frequentist statistical guarantees. We demonstrate our methodology on several numerical examples, including an application to calibration of a limit order book market simulator.
arXiv Detail & Related papers (2021-05-27T00:59:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.