Serverless GPU Architecture for Enterprise HR Analytics: A Production-Scale BDaaS Implementation
- URL: http://arxiv.org/abs/2510.19689v1
- Date: Wed, 22 Oct 2025 15:37:42 GMT
- Title: Serverless GPU Architecture for Enterprise HR Analytics: A Production-Scale BDaaS Implementation
- Authors: Guilin Zhang, Wulan Guo, Ziqi Tan, Srinivas Vippagunta, Suchitra Raman, Shreeshankar Chatterjee, Ju Lin, Shang Liu, Mary Schladenhauffen, Jeffrey Luo, Hailong Jiang,
- Abstract summary: We present a production-oriented Big Data as a Service (BD) blueprint that integrates a singlenode serverless GPU runtime with TabNet.<n>We conduct benchmarks on the HR, Adult, and BLS datasets, comparing our approach against Spark and CPU baselines.<n>Our results show that GPU pipelines achieve up to 4.5x higher throughput, 98x lower latency, and 90% lower cost per 1K inferences compared to Spark baselines.
- Score: 6.240627892585199
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Industrial and government organizations increasingly depend on data-driven analytics for workforce, finance, and regulated decision processes, where timeliness, cost efficiency, and compliance are critical. Distributed frameworks such as Spark and Flink remain effective for massive-scale batch or streaming analytics but introduce coordination complexity and auditing overheads that misalign with moderate-scale, latency-sensitive inference. Meanwhile, cloud providers now offer serverless GPUs, and models such as TabNet enable interpretable tabular ML, motivating new deployment blueprints for regulated environments. In this paper, we present a production-oriented Big Data as a Service (BDaaS) blueprint that integrates a single-node serverless GPU runtime with TabNet. The design leverages GPU acceleration for throughput, serverless elasticity for cost reduction, and feature-mask interpretability for IL4/FIPS compliance. We conduct benchmarks on the HR, Adult, and BLS datasets, comparing our approach against Spark and CPU baselines. Our results show that GPU pipelines achieve up to 4.5x higher throughput, 98x lower latency, and 90% lower cost per 1K inferences compared to Spark baselines, while compliance mechanisms add only ~5.7 ms latency with p99 < 22 ms. Interpretability remains stable under peak load, ensuring reliable auditability. Taken together, these findings provide a compliance-aware benchmark, a reproducible Helm-packaged blueprint, and a decision framework that demonstrate the practicality of secure, interpretable, and cost-efficient serverless GPU analytics for regulated enterprise and government settings.
Related papers
- Adaptive Dual-Weighting Framework for Federated Learning via Out-of-Distribution Detection [53.45696787935487]
Federated Learning (FL) enables collaborative model training across large-scale distributed service nodes.<n>In real-world service-oriented deployments, data generated by heterogeneous users, devices, and application scenarios are inherently non-IID.<n>We propose FLood, a novel FL framework inspired by out-of-distribution (OOD) detection.
arXiv Detail & Related papers (2026-02-01T05:54:59Z) - Scalable and Secure AI Inference in Healthcare: A Comparative Benchmarking of FastAPI and Triton Inference Server on Kubernetes [0.0]
This paper presents a benchmarking analysis comparing two prominent deployment paradigms: a lightweight, Python-based REST service using FastAPI, and a specialized, high-performance serving engine, NVIDIA Triton Inference Server.<n>Our results indicate a distinct trade-off between FastAPI and Triton for single-request workloads.<n>This study validates the hybrid model as a best practice for enterprise clinical AI and offers a blueprint for secure, high-availability deployments.
arXiv Detail & Related papers (2026-01-19T18:48:29Z) - LUT-Compiled Kolmogorov-Arnold Networks for Lightweight DoS Detection on IoT Edge Devices [20.271194684947282]
Kolmogorov-Arnold Networks (KANs) offer a compact alternative to Multi-Layer Perceptrons (MLPs)<n>B-spline evaluation introduces significant computational overhead unsuitable for latency-critical IoT applications.<n>We propose a lookup table (LUT) compilation pipeline that replaces expensive spline computations with precomputed quantized tables.
arXiv Detail & Related papers (2026-01-12T22:32:18Z) - Viability and Performance of a Private LLM Server for SMBs: A Benchmark Analysis of Qwen3-30B on Consumer-Grade Hardware [0.0]
Large Language Models (LLMs) have been accompanied by a reliance on cloud-based, proprietary systems.<n>This paper investigates the feasibility of deploying a high-performance, private LLM inference server at a cost to Small and Medium Businesses.
arXiv Detail & Related papers (2025-12-28T18:08:01Z) - Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs [61.953548065938385]
We introduce the ''Three Taxes'' (Bulk Synchronous, Inter- Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework.<n>We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution.<n>We observe a 10-20% speedup in end-to-end latency over BSP-based approaches.
arXiv Detail & Related papers (2025-11-04T01:15:44Z) - Metrics and evaluations for computational and sustainable AI efficiency [26.52588349722099]
Current approaches fail to provide a holistic view, making it difficult to compare and optimise systems.<n>We propose a unified and reproducible methodology for AI model inference that integrates computational and environmental metrics.<n>Our framework provides pragmatic, carbon-aware evaluation by systematically measuring latency and distributions throughput, energy consumption, and location-adjusted carbon emissions.
arXiv Detail & Related papers (2025-10-18T03:30:15Z) - RepoForge: Training a SOTA Fast-thinking SWE Agent with an End-to-End Data Curation Pipeline Synergizing SFT and RL at Scale [15.199441664697988]
Training software engineering (SWE) LLMs are bottlenecked by expensive infrastructure, inefficient evaluation pipelines, scarce training data, and costly quality control.<n>We present RepoForge, an autonomous, end-to-end pipeline that generates, evaluates, and trains SWE agents at scale.
arXiv Detail & Related papers (2025-08-03T02:34:16Z) - The Larger the Merrier? Efficient Large AI Model Inference in Wireless Edge Networks [56.37880529653111]
The demand for large computation model (LAIM) services is driving a paradigm shift from traditional cloud-based inference to edge-based inference for low-latency, privacy-preserving applications.<n>In this paper, we investigate the LAIM-inference scheme, where a pre-trained LAIM is pruned and partitioned into on-device and on-server sub-models for deployment.
arXiv Detail & Related papers (2025-05-14T08:18:55Z) - LONGER: Scaling Up Long Sequence Modeling in Industrial Recommenders [23.70714095931094]
Long-sequence optimized traNsformer for GPU-Efficient Recommenders.<n>Longer consistently outperforms strong baselines in offline metrics and online A/B testing.
arXiv Detail & Related papers (2025-05-07T13:54:26Z) - Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs [71.7892165868749]
Commercial Large Language Model (LLM) APIs create a fundamental trust problem.<n>Users pay for specific models but have no guarantee that providers deliver them faithfully.<n>We formalize this model substitution problem and evaluate detection methods under realistic adversarial conditions.<n>We propose and evaluate the use of Trusted Execution Environments (TEEs) as one practical and robust solution.
arXiv Detail & Related papers (2025-04-07T03:57:41Z) - Federated PCA and Estimation for Spiked Covariance Matrices: Optimal Rates and Efficient Algorithm [19.673557166734977]
Federated Learning (FL) has gained significant recent attention in machine learning for its enhanced privacy and data security.
This paper investigates federated PCA and estimation for spiked covariance matrices under distributed differential privacy constraints.
We establish minimax rates of convergence, with a key finding that the central server's optimal rate is the harmonic mean of the local clients' minimax rates.
arXiv Detail & Related papers (2024-11-23T21:57:50Z) - SeBS-Flow: Benchmarking Serverless Cloud Function Workflows [51.4200085836966]
We propose the first serverless workflow benchmarking suite SeBS-Flow.<n>SeBS-Flow includes six real-world application benchmarks and four microbenchmarks representing different computational patterns.<n>We conduct comprehensive evaluations on three major cloud platforms, assessing performance, cost, scalability, and runtime deviations.
arXiv Detail & Related papers (2024-10-04T14:52:18Z) - A Specialized Semismooth Newton Method for Kernel-Based Optimal
Transport [92.96250725599958]
Kernel-based optimal transport (OT) estimators offer an alternative, functional estimation procedure to address OT problems from samples.
We show that our SSN method achieves a global convergence rate of $O (1/sqrtk)$, and a local quadratic convergence rate under standard regularity conditions.
arXiv Detail & Related papers (2023-10-21T18:48:45Z) - Adaptive Federated Pruning in Hierarchical Wireless Networks [69.6417645730093]
Federated Learning (FL) is a privacy-preserving distributed learning framework where a server aggregates models updated by multiple devices without accessing their private datasets.
In this paper, we introduce model pruning for HFL in wireless networks to reduce the neural network scale.
We show that our proposed HFL with model pruning achieves similar learning accuracy compared with the HFL without model pruning and reduces about 50 percent communication cost.
arXiv Detail & Related papers (2023-05-15T22:04:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.