Scalable and Secure AI Inference in Healthcare: A Comparative Benchmarking of FastAPI and Triton Inference Server on Kubernetes
- URL: http://arxiv.org/abs/2602.00053v1
- Date: Mon, 19 Jan 2026 18:48:29 GMT
- Title: Scalable and Secure AI Inference in Healthcare: A Comparative Benchmarking of FastAPI and Triton Inference Server on Kubernetes
- Authors: Ratul Ali,
- Abstract summary: This paper presents a benchmarking analysis comparing two prominent deployment paradigms: a lightweight, Python-based REST service using FastAPI, and a specialized, high-performance serving engine, NVIDIA Triton Inference Server.<n>Our results indicate a distinct trade-off between FastAPI and Triton for single-request workloads.<n>This study validates the hybrid model as a best practice for enterprise clinical AI and offers a blueprint for secure, high-availability deployments.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Efficient and scalable deployment of machine learning (ML) models is a prerequisite for modern production environments, particularly within regulated domains such as healthcare and pharmaceuticals. In these settings, systems must balance competing requirements, including minimizing inference latency for real-time clinical decision support, maximizing throughput for batch processing of medical records, and ensuring strict adherence to data privacy standards such as HIPAA. This paper presents a rigorous benchmarking analysis comparing two prominent deployment paradigms: a lightweight, Python-based REST service using FastAPI, and a specialized, high-performance serving engine, NVIDIA Triton Inference Server. Leveraging a reference architecture for healthcare AI, we deployed a DistilBERT sentiment analysis model on Kubernetes to measure median (p50) and tail (p95) latency, as well as throughput, under controlled experimental conditions. Our results indicate a distinct trade-off. While FastAPI provides lower overhead for single-request workloads with a p50 latency of 22 ms, Triton achieves superior scalability through dynamic batching, delivering a throughput of 780 requests per second on a single NVIDIA T4 GPU, nearly double that of the baseline. Furthermore, we evaluate a hybrid architectural approach that utilizes FastAPI as a secure gateway for protected health information de-identification and Triton for backend inference. This study validates the hybrid model as a best practice for enterprise clinical AI and offers a blueprint for secure, high-availability deployments.
Related papers
- AWE: Adaptive Agents for Dynamic Web Penetration Testing [0.0]
AWE is a memory-augmented multi-agent framework for autonomous web penetration testing.<n>It embeds structured, vulnerability-specific analysis pipelines within a lightweight LLM orchestration layer.<n>AWE achieves substantial gains on injection-class vulnerabilities.
arXiv Detail & Related papers (2026-03-01T07:32:42Z) - Cloud-OpsBench: A Reproducible Benchmark for Agentic Root Cause Analysis in Cloud Systems [51.2882705779387]
Cloud-OpsBench is a large-scale benchmark that employs a State Snapshot Paradigm to construct a deterministic digital twin of the cloud.<n>It features 452 distinct fault cases across 40 root cause types spanning the full stack.
arXiv Detail & Related papers (2026-02-28T05:04:42Z) - AI-NativeBench: An Open-Source White-Box Agentic Benchmark Suite for AI-Native Systems [52.65695508605237]
We introduce AI-NativeBench, the first application-centric and white-box AI-Native benchmark suite grounded in Model Context Protocol (MCP) and Agent-to-Agent (A2A) standards.<n>By treating agentic spans as first-class citizens within distributed traces, our methodology enables granular analysis of engineering characteristics beyond simple capabilities.<n>This work provides the first systematic evidence to guide the transition from measuring model capability to engineering reliable AI-Native systems.
arXiv Detail & Related papers (2026-01-14T11:32:07Z) - FFP-300K: Scaling First-Frame Propagation for Generalizable Video Editing [97.35186681023025]
We introduce FFP-300K, a new large-scale dataset of high-fidelity video pairs at 720p resolution and 81 frames in length.<n>We propose a novel framework designed for true guidance-free FFP that resolves the tension between maintaining first-frame appearance and preserving source video motion.
arXiv Detail & Related papers (2026-01-05T01:46:22Z) - Serverless GPU Architecture for Enterprise HR Analytics: A Production-Scale BDaaS Implementation [6.240627892585199]
We present a production-oriented Big Data as a Service (BD) blueprint that integrates a singlenode serverless GPU runtime with TabNet.<n>We conduct benchmarks on the HR, Adult, and BLS datasets, comparing our approach against Spark and CPU baselines.<n>Our results show that GPU pipelines achieve up to 4.5x higher throughput, 98x lower latency, and 90% lower cost per 1K inferences compared to Spark baselines.
arXiv Detail & Related papers (2025-10-22T15:37:42Z) - Local Obfuscation by GLINER for Impartial Context Aware Lineage: Development and evaluation of PII Removal system [3.823253824850948]
LOGICAL is an efficient, locally deployable PII removal system built on a fine-tuned GLiNER model.<n>Fine-tuned GLiNER model achieved superior performance, with an overall micro-average F1-score of 0.980.<n> LOGICAL correctly sanitised 95% of documents completely, compared to 64% for the next-best solution.
arXiv Detail & Related papers (2025-10-22T08:12:07Z) - Leveraging Generative Models for Real-Time Query-Driven Text Summarization in Large-Scale Web Search [54.987957691350665]
Query-Driven Text Summarization (QDTS) aims to generate concise and informative summaries from textual documents based on a given query.<n>Traditional extractive summarization models, based primarily on ranking candidate summary segments, have been the dominant approach in industrial applications.<n>We propose a novel framework to pioneer the application of generative models to address real-time QDTS in industrial web search.
arXiv Detail & Related papers (2025-08-28T08:51:51Z) - VAE-based Feature Disentanglement for Data Augmentation and Compression in Generalized GNSS Interference Classification [42.14439854721613]
We propose variational autoencoders (VAEs) for disentanglement to extract essential latent features that enable accurate classification of interferences.<n>Our proposed VAE achieves a data compression rate ranging from 512 to 8,192 and achieves an accuracy up to 99.92%.
arXiv Detail & Related papers (2025-04-14T13:38:00Z) - ZIA: A Theoretical Framework for Zero-Input AI [0.0]
Zero-Input AI (ZIA) introduces a novel framework for human-computer interaction by enabling proactive intent prediction without explicit user commands.<n>It integrates gaze tracking, bio-signals (EEG, heart rate), and contextual data (time, location, usage history) into a multi-modal model for real-time inference.<n>ZIA provides a scalable, privacy-preserving framework for accessibility, healthcare, and consumer applications, advancing AI toward anticipatory intelligence.
arXiv Detail & Related papers (2025-02-22T07:42:05Z) - Efficient Federated Prompt Tuning for Black-box Large Pre-trained Models [62.838689691468666]
We propose Federated Black-Box Prompt Tuning (Fed-BBPT) to optimally harness each local dataset.
Fed-BBPT capitalizes on a central server that aids local users in collaboratively training a prompt generator through regular aggregation.
Relative to extensive fine-tuning, Fed-BBPT proficiently sidesteps memory challenges tied to PTM storage and fine-tuning on local machines.
arXiv Detail & Related papers (2023-10-04T19:30:49Z) - HOLMES: Health OnLine Model Ensemble Serving for Deep Learning Models in
Intensive Care Units [31.368873375366213]
HOLMES is an online model ensemble serving framework for healthcare applications.
We demonstrate that HOLMES is able to navigate the accuracy/latency tradeoff efficiently, compose the ensemble, and serve the model ensemble pipeline.
HOLMES is tested on risk prediction task on pediatric cardio ICU data with above 95% prediction accuracy and sub-second latency on 64-bed simulation.
arXiv Detail & Related papers (2020-08-10T12:38:46Z) - A Privacy-Preserving-Oriented DNN Pruning and Mobile Acceleration
Framework [56.57225686288006]
Weight pruning of deep neural networks (DNNs) has been proposed to satisfy the limited storage and computing capability of mobile edge devices.
Previous pruning methods mainly focus on reducing the model size and/or improving performance without considering the privacy of user data.
We propose a privacy-preserving-oriented pruning and mobile acceleration framework that does not require the private training dataset.
arXiv Detail & Related papers (2020-03-13T23:52:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.