Alioth: A Machine Learning Based Interference-Aware Performance Monitor
for Multi-Tenancy Applications in Public Cloud
- URL: http://arxiv.org/abs/2307.08949v1
- Date: Tue, 18 Jul 2023 03:34:33 GMT
- Title: Alioth: A Machine Learning Based Interference-Aware Performance Monitor
for Multi-Tenancy Applications in Public Cloud
- Authors: Tianyao Shi, Yingxuan Yang, Yunlong Cheng, Xiaofeng Gao, Zhen Fang,
Yongqiang Yang
- Abstract summary: Multi-tenancy in public clouds may lead to co-location interference on shared resources, which possibly results in performance degradation.
We propose a novel machine learning framework, Alioth, to monitor the performance degradation of cloud applications.
Alioth achieves an average mean absolute error of 5.29% offline and 10.8% when testing on applications unseen in the training stage.
- Score: 15.942285615596566
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-tenancy in public clouds may lead to co-location interference on shared
resources, which possibly results in performance degradation of cloud
applications. Cloud providers want to know when such events happen and how
serious the degradation is, to perform interference-aware migrations and
alleviate the problem. However, virtual machines (VM) in
Infrastructure-as-a-Service public clouds are black-boxes to providers, where
application-level performance information cannot be acquired. This makes
performance monitoring intensely challenging as cloud providers can only rely
on low-level metrics such as CPU usage and hardware counters.
We propose a novel machine learning framework, Alioth, to monitor the
performance degradation of cloud applications. To feed the data-hungry models,
we first elaborate interference generators and conduct comprehensive
co-location experiments on a testbed to build Alioth-dataset which reflects the
complexity and dynamicity in real-world scenarios. Then we construct Alioth by
(1) augmenting features via recovering low-level metrics under no interference
using denoising auto-encoders, (2) devising a transfer learning model based on
domain adaptation neural network to make models generalize on test cases unseen
in offline training, and (3) developing a SHAP explainer to automate feature
selection and enhance model interpretability. Experiments show that Alioth
achieves an average mean absolute error of 5.29% offline and 10.8% when testing
on applications unseen in the training stage, outperforming the baseline
methods. Alioth is also robust in signaling quality-of-service violation under
dynamicity. Finally, we demonstrate a possible application of Alioth's
interpretability, providing insights to benefit the decision-making of cloud
operators. The dataset and code of Alioth have been released on GitHub.
Related papers
- Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment [61.20689382879937]
Task-oriented edge computing addresses this by shifting data analysis to the edge.
Existing methods struggle to balance high model performance with low resource consumption.
We propose a novel co-design framework to optimize neural network architecture.
arXiv Detail & Related papers (2024-10-29T19:02:54Z) - Learning with Noisy Foundation Models [95.50968225050012]
This paper is the first work to comprehensively understand and analyze the nature of noise in pre-training datasets.
We propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization.
arXiv Detail & Related papers (2024-03-11T16:22:41Z) - Effective Intrusion Detection in Heterogeneous Internet-of-Things Networks via Ensemble Knowledge Distillation-based Federated Learning [52.6706505729803]
We introduce Federated Learning (FL) to collaboratively train a decentralized shared model of Intrusion Detection Systems (IDS)
FLEKD enables a more flexible aggregation method than conventional model fusion techniques.
Experiment results show that the proposed approach outperforms local training and traditional FL in terms of both speed and performance.
arXiv Detail & Related papers (2024-01-22T14:16:37Z) - Benchmarking Function Hook Latency in Cloud-Native Environments [0.5188841610098435]
Cloud-native applications are often instrumented or altered at runtime, by dynamically patching or hooking them, which introduces a significant performance overhead.
We present recommendations to mitigate these risks and demonstrate how an improper experimental setup can negatively impact latency measurements.
arXiv Detail & Related papers (2023-10-19T12:54:32Z) - Nebula: Self-Attention for Dynamic Malware Analysis [14.710331873072146]
We introduce Nebula, a versatile, self-attention Transformer-based neural architecture that generalizes across different behavioral representations and formats.
We perform experiments on both malware detection and classification tasks, using three datasets acquired from different dynamic analyses platforms.
We showcase how self-supervised learning pre-training matches the performance of fully-supervised models with only 20% of training data.
arXiv Detail & Related papers (2023-09-19T09:24:36Z) - Cloud-Device Collaborative Adaptation to Continual Changing Environments
in the Real-world [20.547119604004774]
We propose a new learning paradigm of Cloud-Device Collaborative Continual Adaptation, which encourages collaboration between cloud and device.
We also propose an Uncertainty-based Visual Prompt Adapted (U-VPA) teacher-student model to transfer the generalization capability of the large model on the cloud to the device model.
Our proposed U-VPA teacher-student framework outperforms previous state-of-the-art test time adaptation and device-cloud collaboration methods.
arXiv Detail & Related papers (2022-12-02T05:02:36Z) - MetaNetwork: A Task-agnostic Network Parameters Generation Framework for
Improving Device Model Generalization [65.02542875281233]
We propose a novel task-agnostic framework, named MetaNetwork, for generating adaptive device model parameters from cloud without on-device training.
The MetaGenerator is designed to learn a mapping function from samples to model parameters, and it can generate and deliver the adaptive parameters to the device based on samples uploaded from the device to the cloud.
The MetaStabilizer aims to reduce the oscillation of the MetaGenerator, accelerate the convergence and improve the model performance during both training and inference.
arXiv Detail & Related papers (2022-09-12T13:26:26Z) - ESAI: Efficient Split Artificial Intelligence via Early Exiting Using
Neural Architecture Search [6.316693022958222]
Deep neural networks have been outperforming conventional machine learning algorithms in many computer vision-related tasks.
The majority of devices are harnessing the cloud computing methodology in which outstanding deep learning models are responsible for analyzing the data on the server.
In this paper, a new framework for deploying on IoT devices has been proposed which can take advantage of both the cloud and the on-device models.
arXiv Detail & Related papers (2021-06-21T04:47:53Z) - Bridging the Gap Between Clean Data Training and Real-World Inference
for Spoken Language Understanding [76.89426311082927]
Existing models are trained on clean data, which causes a textitgap between clean data training and real-world inference.
We propose a method from the perspective of domain adaptation, by which both high- and low-quality samples are embedding into similar vector space.
Experiments on the widely-used dataset, Snips, and large scale in-house dataset (10 million training examples) demonstrate that this method not only outperforms the baseline models on real-world (noisy) corpus but also enhances the robustness, that is, it produces high-quality results under a noisy environment.
arXiv Detail & Related papers (2021-04-13T17:54:33Z) - Towards AIOps in Edge Computing Environments [60.27785717687999]
This paper describes the system design of an AIOps platform which is applicable in heterogeneous, distributed environments.
It is feasible to collect metrics with a high frequency and simultaneously run specific anomaly detection algorithms directly on edge devices.
arXiv Detail & Related papers (2021-02-12T09:33:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.