Related papers: Detecting Anomalies in Machine Learning Infrastructure via Hardware Telemetry

Detecting Anomalies in Machine Learning Infrastructure via Hardware Telemetry

URL: http://arxiv.org/abs/2510.26008v2
Date: Fri, 31 Oct 2025 01:21:23 GMT
Title: Detecting Anomalies in Machine Learning Infrastructure via Hardware Telemetry
Authors: Ziji Chen, Steven W. D. Chien, Peng Qian, Noa Zilberman,
Abstract summary: workload knowledge is unnecessary for system-level optimization.<n>We propose Reveal, which takes a hardware-centric approach, relying only on hardware signals.<n>We successfully identified both network and system configuration issues, accelerating the DeepSeek model by 5.97%.
Score: 6.238074548326156
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern machine learning (ML) has grown into a tightly coupled, full-stack ecosystem that combines hardware, software, network, and applications. Many users rely on cloud providers for elastic, isolated, and cost-efficient resources. Unfortunately, these platforms as a service use virtualization, which means operators have little insight into the users' workloads. This hinders resource optimizations by the operator, which is essential to ensure cost efficiency and minimize execution time. In this paper, we argue that workload knowledge is unnecessary for system-level optimization. We propose Reveal, which takes a hardware-centric approach, relying only on hardware signals - fully accessible by operators. Using low-level signals collected from the system, Reveal detects anomalies through an unsupervised learning pipeline. The pipeline is developed by analyzing over 30 popular ML models on various hardware platforms, ensuring adaptability to emerging workloads and unknown deployment patterns. Using Reveal, we successfully identified both network and system configuration issues, accelerating the DeepSeek model by 5.97%.

Related papers

iOS as Acceleration [51.56484100374058]
We present a proof-of-concept system demonstrating a novel approach to harness an iOS device via distributed pipeline parallelism.<n>The findings of this paper highlight the potential for the improving commonplace mobile devices to provide greater contributions to machine learning.
arXiv Detail & Related papers (2025-12-19T13:30:44Z)
RockNet: Distributed Learning on Ultra-Low-Power Devices [49.01692357536576]
This paper presents RockNet, a new TinyML method tailored for ultra-low-power hardware.<n>By leveraging that CPS consist of multiple devices, we design a distributed learning method that integrates Machine Learning and wireless communication.<n>Our results show that a tight integration of distributed ML, distributed computing, and communication enables, for the first time, training on ultra-low-power hardware with state-of-the-art accuracy.
arXiv Detail & Related papers (2025-10-15T09:09:30Z)
Characterizing the Efficiency of Distributed Training: A Power, Performance, and Thermal Perspective [6.51239603014107]
Large Language Models (LLMs) have pushed training workloads beyond the limits of single-node analysis.<n>We present a comprehensive characterization of LLM training across diverse real-world workloads and hardware platforms.
arXiv Detail & Related papers (2025-09-12T16:05:07Z)
BanditWare: A Contextual Bandit-based Framework for Hardware Prediction [0.0]
BanditWare is an online recommendation system that dynamically selects the most suitable hardware for applications.<n>Unlike traditional statistical and machine learning approaches, BanditWare operates online, learning and adapting in real-time as new workloads arrive.
arXiv Detail & Related papers (2025-06-16T17:40:34Z)
FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency. We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs) We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z)
TDML -- A Trustworthy Distributed Machine Learning Framework [7.302091381583343]
The rapid advancement of large models (LM) has intensified the demand for computing resources. This demand is exacerbated by limited availability due to supply chain delays and monopolistic acquisition by major tech firms. We propose a textittrustworthy distributed machine learning (TDML) framework that leverages guidance to coordinate remote trainers and validate workloads.
arXiv Detail & Related papers (2024-07-10T03:22:28Z)
Distributed Inference and Fine-tuning of Large Language Models Over The Internet [91.00270820533272]
Large language models (LLMs) are useful in many NLP tasks and become more capable with size. These models require high-end hardware, making them inaccessible to most researchers. We develop fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput.
arXiv Detail & Related papers (2023-12-13T18:52:49Z)
Cost-Driven Hardware-Software Co-Optimization of Machine Learning Pipelines [5.3477186309338505]
Deep neural networks are increasingly being used to embed intelligence in smart devices. Their storage and processing requirements make them prohibitive for cheap, off-the-shelf platforms. We holistically explore how quantization, model scaling, and multi-modality interact with system components such as memory, sensors, and processors.
arXiv Detail & Related papers (2023-10-11T23:22:30Z)
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU. This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z)
FLEdge: Benchmarking Federated Machine Learning Applications in Edge Computing Systems [61.335229621081346]
Federated Learning (FL) has become a viable technique for realizing privacy-enhancing distributed deep learning on the network edge. In this paper, we propose FLEdge, which complements existing FL benchmarks by enabling a systematic evaluation of client capabilities.
arXiv Detail & Related papers (2023-06-08T13:11:20Z)
FPGA-optimized Hardware acceleration for Spiking Neural Networks [69.49429223251178]
This work presents the development of a hardware accelerator for an SNN, with off-line training, applied to an image recognition task. The design targets a Xilinx Artix-7 FPGA, using in total around the 40% of the available hardware resources. It reduces the classification time by three orders of magnitude, with a small 4.5% impact on the accuracy, if compared to its software, full precision counterpart.
arXiv Detail & Related papers (2022-01-18T13:59:22Z)
SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines. This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z)
Clairvoyant Prefetching for Distributed Machine Learning I/O [9.490118207943192]
I/O is emerging as a major bottleneck for machine learning training, especially in distributed environments such as clouds and supercomputers. We produce a novel machine learning I/O, HDMLP, to tackle the I/O bottleneck. HDMLP provides an easy-to-use, flexible, scalable solution that delivers better performance than state-of-the-art approaches.
arXiv Detail & Related papers (2021-01-21T17:21:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.