Related papers: AIvailable: A Software-Defined Architecture for LLM-as-a-Service on Heterogeneous and Legacy GPUs

AIvailable: A Software-Defined Architecture for LLM-as-a-Service on Heterogeneous and Legacy GPUs

URL: http://arxiv.org/abs/2511.11621v1
Date: Thu, 06 Nov 2025 14:19:57 GMT
Title: AIvailable: A Software-Defined Architecture for LLM-as-a-Service on Heterogeneous and Legacy GPUs
Authors: Pedro Antunes, Ana Rita Ortigoso, Gabriel Vieira, Daniel Fuentes, Luís Frazão, Nuno Costa, António Pereira,
Abstract summary: We introduce AIvailable, a low-cost, highly available LLM-as-a-Service (LLM) platform.<n>It uses a software-defined approach for running LLMs across heterogeneous and legacy GPU nodes.<n>It features a unified client interface that allows seamless interaction with all deployed LLMs.
Score: 0.5863360388454261
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rise of Large Language Models (LLM) has increased the need for scalable, high-performance inference systems, yet most existing frameworks assume homogeneous, resource-rich hardware, often unrealistic in academic, or resource-constrained settings. We introduce AIvailable, a low-cost, highly available LLM-as-a-Service (LLMaaS) platform, that uses a software-defined approach for running LLMs across heterogeneous and legacy GPU nodes, including NVIDIA and AMD devices, with a focus on fully utilizing each node's VRAM. AIvailable operates as a fully GPU-accelerated inference without CPU fallbacks, featuring a unified client interface that allows seamless interaction with all deployed LLMs through a single logical unit. The architecture comprises four main components: the Client Interface for user access, the Service Frontend for secure request routing and load balancing, the SDAI Controller for orchestration, deployment, and monitoring, and the Service Backend of heterogeneous GPU nodes executing workloads. By abstracting GPU-specific details and providing dynamic, VRAM-aware allocation and reallocation of models, AIvailable ensures efficient use of resources and resilience against failures or workload fluctuations. Targeting academic labs, private companies, and other constrained organizations, it supports diverse open LLMs helping democratize generative AI through the repurposing of legacy GPUs.

Related papers

FlashInfer-Bench: Building the Virtuous Cycle for AI-driven LLM Systems [39.33711841865621]
FlashInfer-Bench is a framework that connects kernel generation, benchmarking, and deployment.<n>Built on real serving traces, FlashInfer-Bench includes a curated dataset, a robust correctness- and performance-aware benchmarking framework, and a public leaderboard.<n>Using FlashInfer-Bench, we evaluate the performance and limitations of LLM agents, compare the trade-offs among different GPU programming languages, and provide insights for future agent design.
arXiv Detail & Related papers (2026-01-01T06:18:53Z)
Enabling Disaggregated Multi-Stage MLLM Inference via GPU-Internal Scheduling and Resource Sharing [16.063514680699576]
Multimodal large language models (MLLMs) extend visual understanding through a three-stage pipeline.<n> multimodal preprocessing-especially video decoding-often dominates Time-to-First-Token (TTFT)<n>We present FlashCodec and UnifiedServe, two complementary designs that jointly optimize the end-to-end MLLM pipeline.
arXiv Detail & Related papers (2025-12-19T13:40:13Z)
GPU-Virt-Bench: A Comprehensive Benchmarking Framework for Software-Based GPU Virtualization Systems [0.0]
GPU-Virt-Bench is a comprehensive benchmarking framework that evaluates GPU virtualization systems across 56 performance metrics.<n>We demonstrate the framework's utility through evaluation of HAMi-core, BUD-FCSP, and simulated MIG baselines.
arXiv Detail & Related papers (2025-11-26T09:42:05Z)
LLMServingSim2.0: A Unified Simulator for Heterogeneous Hardware and Serving Techniques in LLM Infrastructure [4.382902234869111]
This paper introduces LLMServingSim2.0, a system simulator designed for exploring heterogeneous hardware in large-scale LLM serving systems.<n>It addresses two key limitations of its predecessor: (1) integrating hardware models into system-level simulators is non-trivial due to the lack of a clear abstraction, and (2) existing simulators support only a narrow subset of serving techniques.
arXiv Detail & Related papers (2025-11-10T15:47:53Z)
Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs [61.953548065938385]
We introduce the ''Three Taxes'' (Bulk Synchronous, Inter- Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework.<n>We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution.<n>We observe a 10-20% speedup in end-to-end latency over BSP-based approaches.
arXiv Detail & Related papers (2025-11-04T01:15:44Z)
Scalable GPU-Based Integrity Verification for Large Machine Learning Models [4.301162531343759]
We present a security framework that strengthens distributed machine learning by standardizing integrity protections across CPU and GPU platforms.<n>Our approach co-locates integrity verification directly with large ML model execution on GPU accelerators.<n>We provide a hardware-agnostic foundation that enterprise teams can deploy regardless of their underlying CPU and GPU infrastructures.
arXiv Detail & Related papers (2025-10-27T23:45:21Z)
xLLM Technical Report [57.13120905321185]
We introduce xLLM, an intelligent and efficient Large Language Model (LLM) inference framework.<n>xLLM builds a novel decoupled service-engine architecture.<n>xLLM-Engine co-optimizes system and algorithm designs to fully saturate computing resources.
arXiv Detail & Related papers (2025-10-16T13:53:47Z)
DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [114.61347672265076]
Development of MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms. We propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR) that automatically adjusts the size of the activated MLLM. DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance.
arXiv Detail & Related papers (2024-11-04T18:26:08Z)
Demystifying AI Platform Design for Distributed Inference of Next-Generation LLM models [8.02264001053969]
Large language models (LLMs) have shown remarkable performance across a wide range of applications, often outperforming human experts.<n>With constant innovation in LLM serving optimizations and model architecture evolving at breakneck speed, the hardware requirements to meet Service Level Objectives (SLOs) remain an open research question.<n>We present an analytical tool, GenZ, to efficiently navigate the relationship between diverse LLM model architectures and AI platform design parameters.
arXiv Detail & Related papers (2024-06-03T18:00:50Z)
Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly [62.473245910234304]
This paper takes a hardware-centric approach to explore how Large Language Models can be brought to modern edge computing systems. We provide a micro-level hardware benchmark, compare the model FLOP utilization to a state-of-the-art data center GPU, and study the network utilization in realistic conditions.
arXiv Detail & Related papers (2023-10-04T20:27:20Z)
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU. This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z)
Energy-efficient Task Adaptation for NLP Edge Inference Leveraging Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks. We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.