Streamlining Resilient Kubernetes Autoscaling with Multi-Agent Systems via an Automated Online Design Framework
- URL: http://arxiv.org/abs/2505.21559v1
- Date: Mon, 26 May 2025 20:39:31 GMT
- Title: Streamlining Resilient Kubernetes Autoscaling with Multi-Agent Systems via an Automated Online Design Framework
- Authors: Julien Soulé, Jean-Paul Jamont, Michel Occello, Louis-Marie Traonouez, Paul Théron,
- Abstract summary: Cloud-native systems often face challenges to their operational resilience due to poor workload management issues.<n>We propose decomposing the overarching goal of maintaining operational resilience into failure-specific sub-goals delegated to collaborative agents.<n>We introduce an automated, four-phase online framework for HPA MAS design: 1) modeling a digital twin built from cluster traces; 2) training agents in simulation using roles and missions tailored to failure contexts; 3) analyzing agent behaviors for explainability; and 4) transferring learned policies to the real cluster.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In cloud-native systems, Kubernetes clusters with interdependent services often face challenges to their operational resilience due to poor workload management issues such as resource blocking, bottlenecks, or continuous pod crashes. These vulnerabilities are further amplified in adversarial scenarios, such as Distributed Denial-of-Service attacks (DDoS). Conventional Horizontal Pod Autoscaling (HPA) approaches struggle to address such dynamic conditions, while reinforcement learning-based methods, though more adaptable, typically optimize single goals like latency or resource usage, neglecting broader failure scenarios. We propose decomposing the overarching goal of maintaining operational resilience into failure-specific sub-goals delegated to collaborative agents, collectively forming an HPA Multi-Agent System (MAS). We introduce an automated, four-phase online framework for HPA MAS design: 1) modeling a digital twin built from cluster traces; 2) training agents in simulation using roles and missions tailored to failure contexts; 3) analyzing agent behaviors for explainability; and 4) transferring learned policies to the real cluster. Experimental results demonstrate that the generated HPA MASs outperform three state-of-the-art HPA systems in sustaining operational resilience under various adversarial conditions in a proposed complex cluster.
Related papers
- K^2-Agent: Co-Evolving Know-What and Know-How for Hierarchical Mobile Device Control [73.50217471850658]
K2-Agent is a hierarchical framework that models human-like cognition by knowing and co-evolving declarative (what) and procedural (how) knowledge for planning and execution.<n>On the challenging AndroidWorld benchmark, K2-Agent achieves a 76.1% success rate using only raw and open-source backbones.
arXiv Detail & Related papers (2026-02-28T14:33:14Z) - MagicAgent: Towards Generalized Agent Planning [73.21129030631421]
We present textbfMagicAgent, a series of foundation models specifically designed for generalized agent planning.<n>We introduce a lightweight and scalable synthetic data framework that generates high-quality trajectories across diverse planning tasks.<n>We show that MagicAgent-32B and MagicAgent-30B-A3B achieve superior performance across diverse open-source benchmarks.
arXiv Detail & Related papers (2026-02-22T01:39:16Z) - AgentCgroup: Understanding and Controlling OS Resources of AI Agents [2.8139711959925244]
AI agents are increasingly deployed in multi-tenant cloud environments, where they execute diverse tool calls within sandboxed containers.<n>We present a systematic characterization of OS-level resource dynamics in sandboxed AI coding agents.<n>Preliminary evaluation demonstrates improved multi-tenant isolation and reduced resource waste.
arXiv Detail & Related papers (2026-02-10T02:37:42Z) - ComAgent: Multi-LLM based Agentic AI Empowered Intelligent Wireless Networks [62.031889234230725]
6G networks rely on complex cross-layer optimization.<n> manually translating high-level intents into mathematical formulations remains a bottleneck.<n>We present ComAgent, a multi-LLM agentic AI framework.
arXiv Detail & Related papers (2026-01-27T13:43:59Z) - Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails [103.05296856071931]
We identify the Alignment Tipping Process (ATP), a critical post-deployment risk unique to self-evolving Large Language Model (LLM) agents.<n>ATP arises when continual interaction drives agents to abandon alignment constraints established during training in favor of reinforced, self-interested strategies.<n>Our experiments show that alignment benefits erode rapidly under self-evolution, with initially aligned models converging toward unaligned states.
arXiv Detail & Related papers (2025-10-06T14:48:39Z) - Diagnose, Localize, Align: A Full-Stack Framework for Reliable LLM Multi-Agent Systems under Instruction Conflicts [75.20929587906228]
Large Language Model (LLM)-powered multi-agent systems (MAS) have rapidly advanced collaborative reasoning, tool use, and role-specialized coordination in complex tasks.<n>However, reliability-critical deployment remains hindered by a systemic failure mode: hierarchical compliance under instruction conflicts.
arXiv Detail & Related papers (2025-09-27T08:43:34Z) - From MAS to MARS: Coordination Failures and Reasoning Trade-offs in Hierarchical Multi-Agent Robotic Systems within a Healthcare Scenario [3.5262044630932254]
Multi-agent robotic systems (MARS) build upon multi-agent systems by integrating physical and task-related constraints.<n>Despite the availability of advanced multi-agent frameworks, their real-world deployment on robots remains limited.
arXiv Detail & Related papers (2025-08-06T17:54:10Z) - CyGATE: Game-Theoretic Cyber Attack-Defense Engine for Patch Strategy Optimization [73.13843039509386]
This paper presents CyGATE, a game-theoretic framework modeling attacker-defender interactions.<n>CyGATE frames cyber conflicts as a partially observable game (POSG) across Cyber Kill Chain stages.<n>The framework's flexible architecture enables extension to multi-agent scenarios.
arXiv Detail & Related papers (2025-08-01T09:53:06Z) - Hierarchical Adversarially-Resilient Multi-Agent Reinforcement Learning for Cyber-Physical Systems Security [0.0]
This paper introduces a novel Hierarchical Adversarially-Resilient Multi-Agent Reinforcement Learning framework.<n>The framework incorporates an adversarial training loop designed to simulate and anticipate evolving cyber threats.
arXiv Detail & Related papers (2025-06-12T01:38:25Z) - Compositional Learning for Modular Multi-Agent Self-Organizing Networks [0.7122137885660501]
Self-organizing networks face challenges from complex parameter interdependencies and conflicting objectives.<n>This study introduces two compositional learning approaches-Compositional Deep Reinforcement Learning (CDRL) and Compositional Predictive Decision-Making (CPDM)<n>We propose a modular, two-tier framework with cell-level and cell-pair-level agents to manage heterogeneous agent granularities while reducing model complexity.
arXiv Detail & Related papers (2025-06-03T08:33:18Z) - Offline Multi-agent Reinforcement Learning via Score Decomposition [51.23590397383217]
offline multi-agent reinforcement learning (MARL) faces critical challenges due to distributional shifts and the high dimensionality of joint action spaces.<n>We propose a novel two-stage framework for modeling diverse multi-agent coordination patterns.<n>Our approach provides new insights into offline coordination and equilibrium selection in cooperative multi-agent systems.
arXiv Detail & Related papers (2025-05-09T11:42:31Z) - Graph Based Deep Reinforcement Learning Aided by Transformers for Multi-Agent Cooperation [2.8169258551959544]
We propose a novel framework that integrates Graph Neural Networks (GNNs), Deep Reinforcement Learning (DRL), and transformer-based mechanisms for enhanced multi-agent coordination and collective task execution.<n>Our approach leverages GNNs to model agent-agent and agent-goal interactions through adaptive graph construction, enabling efficient information aggregation and decision-making under constrained communication.
arXiv Detail & Related papers (2025-04-11T01:46:18Z) - AgentNet: Decentralized Evolutionary Coordination for LLM-based Multi-Agent Systems [22.291969093748005]
AgentNet is a decentralized, Retrieval-Augmented Generation (RAG)-based framework for multi-agent systems.<n>Unlike traditional multi-agent systems that depend on static assignments or centralized control, AgentNet allows agents to specialize dynamically.<n>AgentNet promotes scalable adaptability and enables privacy-preserving collaboration across organizations.
arXiv Detail & Related papers (2025-04-01T09:45:25Z) - Intelligent Mobile AI-Generated Content Services via Interactive Prompt Engineering and Dynamic Service Provisioning [55.641299901038316]
AI-generated content can organize collaborative Mobile AIGC Service Providers (MASPs) at network edges to provide ubiquitous and customized content for resource-constrained users.<n>Such a paradigm faces two significant challenges: 1) raw prompts often lead to poor generation quality due to users' lack of experience with specific AIGC models, and 2) static service provisioning fails to efficiently utilize computational and communication resources.<n>We develop an interactive prompt engineering mechanism that leverages a Large Language Model (LLM) to generate customized prompt corpora and employs Inverse Reinforcement Learning (IRL) for policy imitation.
arXiv Detail & Related papers (2025-02-17T03:05:20Z) - Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence [79.5316642687565]
Existing multi-agent frameworks often struggle with integrating diverse capable third-party agents.
We propose the Internet of Agents (IoA), a novel framework that addresses these limitations.
IoA introduces an agent integration protocol, an instant-messaging-like architecture design, and dynamic mechanisms for agent teaming and conversation flow control.
arXiv Detail & Related papers (2024-07-09T17:33:24Z) - Efficient Adversarial Training in LLMs with Continuous Attacks [99.5882845458567]
Large language models (LLMs) are vulnerable to adversarial attacks that can bypass their safety guardrails.
We propose a fast adversarial training algorithm (C-AdvUL) composed of two losses.
C-AdvIPO is an adversarial variant of IPO that does not require utility data for adversarially robust alignment.
arXiv Detail & Related papers (2024-05-24T14:20:09Z) - MADiff: Offline Multi-agent Learning with Diffusion Models [79.18130544233794]
MADiff is a diffusion-based multi-agent learning framework.<n>It works as both a decentralized policy and a centralized controller.<n>Our experiments demonstrate that MADiff outperforms baseline algorithms across various multi-agent learning tasks.
arXiv Detail & Related papers (2023-05-27T02:14:09Z) - Multi-Agent Reinforcement Learning for Long-Term Network Resource
Allocation through Auction: a V2X Application [7.326507804995567]
We formulate offloading of computational tasks from a dynamic group of mobile agents (e.g., cars) as decentralized decision making among autonomous agents.
We design an interaction mechanism that incentivizes such agents to align private and system goals by balancing between competition and cooperation.
We propose a novel multi-agent online learning algorithm that learns with partial, delayed and noisy state information.
arXiv Detail & Related papers (2022-07-29T10:29:06Z) - Adaptative Perturbation Patterns: Realistic Adversarial Learning for
Robust NIDS [0.3867363075280543]
Adrial attacks pose a major threat to machine learning and to the systems that rely on it.
This work introduces the Adaptative Perturbation Pattern Method (A2PM) to fulfill these constraints in a gray-box setting.
A2PM relies on pattern sequences that are independently adapted to the characteristics of each class to create valid and coherent data perturbations.
arXiv Detail & Related papers (2022-03-08T17:52:09Z) - Towards AIOps in Edge Computing Environments [60.27785717687999]
This paper describes the system design of an AIOps platform which is applicable in heterogeneous, distributed environments.
It is feasible to collect metrics with a high frequency and simultaneously run specific anomaly detection algorithms directly on edge devices.
arXiv Detail & Related papers (2021-02-12T09:33:00Z) - Automated Adversary Emulation for Cyber-Physical Systems via
Reinforcement Learning [4.763175424744536]
We develop an automated, domain-aware approach to adversary emulation for cyber-physical systems.
We formulate a Markov Decision Process (MDP) model to determine an optimal attack sequence over a hybrid attack graph.
We apply model-based and model-free reinforcement learning (RL) methods to solve the discrete-continuous MDP in a tractable fashion.
arXiv Detail & Related papers (2020-11-09T18:44:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.