Model-based Reinforcement Learning for Service Mesh Fault Resiliency in
a Web Application-level
- URL: http://arxiv.org/abs/2110.13621v1
- Date: Thu, 21 Oct 2021 23:30:40 GMT
- Title: Model-based Reinforcement Learning for Service Mesh Fault Resiliency in
a Web Application-level
- Authors: Fanfei Meng, Lalita Jagadeesan, Marina Thottan
- Abstract summary: We present a model-based reinforcement learning workflow towards service mesh fault resiliency.
Our approach enables the prediction of the most significant fault resilience behaviors at a web application-level.
- Score: 0.7519872646378836
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Microservice-based architectures enable different aspects of web applications
to be created and updated independently, even after deployment. Associated
technologies such as service mesh provide application-level fault resilience
through attribute configurations that govern the behavior of request-response
service -- and the interactions among them -- in the presence of failures.
While this provides tremendous flexibility, the configured values of these
attributes -- and the relationships among them -- can significantly affect the
performance and fault resilience of the overall application. Furthermore, it is
impossible to determine the best and worst combinations of attribute values
with respect to fault resiliency via testing, due to the complexities of the
underlying distributed system and the many possible attribute value
combinations. In this paper, we present a model-based reinforcement learning
workflow towards service mesh fault resiliency. Our approach enables the
prediction of the most significant fault resilience behaviors at a web
application-level, scratching from single service to aggregated multi-service
management with efficient agent collaborations.
Related papers
- Adaptive Dual-Weighting Framework for Federated Learning via Out-of-Distribution Detection [53.45696787935487]
Federated Learning (FL) enables collaborative model training across large-scale distributed service nodes.<n>In real-world service-oriented deployments, data generated by heterogeneous users, devices, and application scenarios are inherently non-IID.<n>We propose FLood, a novel FL framework inspired by out-of-distribution (OOD) detection.
arXiv Detail & Related papers (2026-02-01T05:54:59Z) - An Integrated Fusion Framework for Ensemble Learning Leveraging Gradient Boosting and Fuzzy Rule-Based Models [59.13182819190547]
Fuzzy rule-based models excel in interpretability and have seen widespread application across diverse fields.<n>They face challenges such as complex design specifications and scalability issues with large datasets.<n>This paper proposes an Integrated Fusion Framework that merges the strengths of both paradigms to enhance model performance and interpretability.
arXiv Detail & Related papers (2025-11-11T10:28:23Z) - FLAS: a combination of proactive and reactive auto-scaling architecture for distributed services [0.0]
We present FLAS (Forecasted Load Auto-Scaling), an auto-scaler for distributed services.<n>It combines the advantages of proactive and reactive approaches according to the situation to decide the optimal scaling actions.<n>We provide a FLAS implementation for the use case of a content-based publish-subscribe distributed systems.
arXiv Detail & Related papers (2025-10-23T09:38:07Z) - Learning Unified System Representations for Microservice Tail Latency Prediction [8.532290784939967]
Microservice architectures have become the de facto standard for building scalable cloud-native applications.<n>Traditional approaches often rely on per-request latency metrics, which are highly sensitive to transient noise.<n>We propose USRFNet, a deep learning network that explicitly separates and models traffic-side and resource-side features.
arXiv Detail & Related papers (2025-08-03T07:46:23Z) - MEL: Multi-level Ensemble Learning for Resource-Constrained Environments [1.59297928921015]
We propose a new framework for resilient edge inference, Multi-Level Ensemble Learning (MEL)<n>MEL trains multiple lightweight backup models capable of operating collaboratively, refining each other when multiple servers are available, and independently under failures.<n> Empirical evaluations across vision, language, and audio datasets show that MEL provides performance comparable to original architectures.
arXiv Detail & Related papers (2025-06-25T02:33:57Z) - Federated In-Context Learning: Iterative Refinement for Improved Answer Quality [62.72381208029899]
In-context learning (ICL) enables language models to generate responses without modifying their parameters by leveraging examples provided in the input.<n>We propose Federated In-Context Learning (Fed-ICL), a general framework that enhances ICL through an iterative, collaborative process.<n>Fed-ICL progressively refines responses by leveraging multi-round interactions between clients and a central server, improving answer quality without the need to transmit model parameters.
arXiv Detail & Related papers (2025-06-09T05:33:28Z) - InvFussion: Bridging Supervised and Zero-shot Diffusion for Inverse Problems [76.39776789410088]
This work introduces a framework that combines the strong performance of supervised approaches and the flexibility of zero-shot methods.
A novel architectural design seamlessly integrates the degradation operator directly into the denoiser.
Experimental results on the FFHQ and ImageNet datasets demonstrate state-of-the-art posterior-sampling performance.
arXiv Detail & Related papers (2025-04-02T12:40:57Z) - Robust Asymmetric Heterogeneous Federated Learning with Corrupted Clients [60.22876915395139]
This paper studies a challenging robust federated learning task with model heterogeneous and data corrupted clients.
Data corruption is unavoidable due to factors such as random noise, compression artifacts, or environmental conditions in real-world deployment.
We propose a novel Robust Asymmetric Heterogeneous Federated Learning framework to address these issues.
arXiv Detail & Related papers (2025-03-12T09:52:04Z) - LLM Assisted Anomaly Detection Service for Site Reliability Engineers: Enhancing Cloud Infrastructure Resilience [5.644170923282226]
This paper introduces a scalable Anomaly Detection Service with a generalizable API tailored for industrial time-series data.
We provide insights into the usage patterns of the service, with over 500 users and 200,000 API calls in a year.
We plan to extend the system to include time series foundation models, enabling zero-shot anomaly detection capabilities.
arXiv Detail & Related papers (2025-01-28T06:41:37Z) - SLA Management in Reconfigurable Multi-Agent RAG: A Systems Approach to Question Answering [0.0]
Real-world applications impose diverse Service Level Agreements (SLAs) and Quality of Service (QoS) requirements.
We present a systems-oriented approach to multi-agent RAG tailored for real-world Question Answering (QA) applications.
arXiv Detail & Related papers (2024-12-07T01:32:13Z) - Exploiting Structure in Offline Multi-Agent RL: The Benefits of Low Interaction Rank [52.831993899183416]
We introduce a structural assumption -- the interaction rank -- and establish that functions with low interaction rank are significantly more robust to distribution shift compared to general ones.
We demonstrate that utilizing function classes with low interaction rank, when combined with regularization and no-regret learning, admits decentralized, computationally and statistically efficient learning in offline MARL.
arXiv Detail & Related papers (2024-10-01T22:16:22Z) - Leveraging Interpretability in the Transformer to Automate the Proactive Scaling of Cloud Resources [1.1470070927586018]
We develop a model that captures the relationship between an end-to-end latency, requests at the front-end level, and resource utilization.
We then use the developed model to predict the end-to-end latency.
We demonstrate the merit of a microservice-based application and provide a roadmap to deployment.
arXiv Detail & Related papers (2024-09-04T22:03:07Z) - Investigating the Role of Instruction Variety and Task Difficulty in Robotic Manipulation Tasks [50.75902473813379]
This work introduces a comprehensive evaluation framework that systematically examines the role of instructions and inputs in the generalisation abilities of such models.
The proposed framework uncovers the resilience of multimodal models to extreme instruction perturbations and their vulnerability to observational changes.
arXiv Detail & Related papers (2024-07-04T14:36:49Z) - DeepScaler: Holistic Autoscaling for Microservices Based on
Spatiotemporal GNN with Adaptive Graph Learning [4.128665560397244]
This paper presents DeepScaler, a deep learning-based holistic autoscaling approach.
It focuses on coping with service dependencies to optimize service-level agreements (SLA) assurance and cost efficiency.
Experimental results demonstrate that our method implements a more effective autoscaling mechanism for microservice.
arXiv Detail & Related papers (2023-09-02T08:22:21Z) - Learning Prompt-Enhanced Context Features for Weakly-Supervised Video
Anomaly Detection [37.99031842449251]
Video anomaly detection under weak supervision presents significant challenges.
We present a weakly supervised anomaly detection framework that focuses on efficient context modeling and enhanced semantic discriminability.
Our approach significantly improves the detection accuracy of certain anomaly sub-classes, underscoring its practical value and efficacy.
arXiv Detail & Related papers (2023-06-26T06:45:16Z) - Slimmable Domain Adaptation [112.19652651687402]
We introduce a simple framework, Slimmable Domain Adaptation, to improve cross-domain generalization with a weight-sharing model bank.
Our framework surpasses other competing approaches by a very large margin on multiple benchmarks.
arXiv Detail & Related papers (2022-06-14T06:28:04Z) - Federated Learning with Unreliable Clients: Performance Analysis and
Mechanism Design [76.29738151117583]
Federated Learning (FL) has become a promising tool for training effective machine learning models among distributed clients.
However, low quality models could be uploaded to the aggregator server by unreliable clients, leading to a degradation or even a collapse of training.
We model these unreliable behaviors of clients and propose a defensive mechanism to mitigate such a security risk.
arXiv Detail & Related papers (2021-05-10T08:02:27Z) - MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and
Architectures [61.73533544385352]
We propose a transferable perturbation, MetaPerturb, which is meta-learned to improve generalization performance on unseen data.
As MetaPerturb is a set-function trained over diverse distributions across layers and tasks, it can generalize heterogeneous tasks and architectures.
arXiv Detail & Related papers (2020-06-13T02:54:59Z) - AI-based Resource Allocation: Reinforcement Learning for Adaptive
Auto-scaling in Serverless Environments [0.0]
Serverless computing has emerged as a compelling new paradigm of cloud computing models in recent years.
A common approach among both commercial and open source serverless computing platforms is workload-based auto-scaling.
In this paper we investigate the applicability of a reinforcement learning approach to request-based auto-scaling in a serverless framework.
arXiv Detail & Related papers (2020-05-29T06:18:39Z) - Dataless Model Selection with the Deep Frame Potential [45.16941644841897]
We quantify networks by their intrinsic capacity for unique and robust representations.
We propose the deep frame potential: a measure of coherence that is approximately related to representation stability but has minimizers that depend only on network structure.
We validate its use as a criterion for model selection and demonstrate correlation with generalization error on a variety of common residual and densely connected network architectures.
arXiv Detail & Related papers (2020-03-30T23:27:25Z) - Dynamic Federated Learning [57.14673504239551]
Federated learning has emerged as an umbrella term for centralized coordination strategies in multi-agent environments.
We consider a federated learning model where at every iteration, a random subset of available agents perform local updates based on their data.
Under a non-stationary random walk model on the true minimizer for the aggregate optimization problem, we establish that the performance of the architecture is determined by three factors, namely, the data variability at each agent, the model variability across all agents, and a tracking term that is inversely proportional to the learning rate of the algorithm.
arXiv Detail & Related papers (2020-02-20T15:00:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.