Tracing and Metrics Design Patterns for Monitoring Cloud-native Applications
- URL: http://arxiv.org/abs/2510.02991v1
- Date: Fri, 03 Oct 2025 13:32:15 GMT
- Title: Tracing and Metrics Design Patterns for Monitoring Cloud-native Applications
- Authors: Carlos Albuquerque, Filipe F. Correia,
- Abstract summary: This paper builds upon our previous work and introduces three design patterns that address key challenges in monitoring cloud-native applications.<n> Distributed Tracing improves visibility into request flows across services, aiding in latency analysis and root cause detection, Application Metrics provides a structured approach to instrumenting applications with meaningful performance indicators, and Infrastructure Metrics focuses on monitoring the environment in which the system is operated.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Observability helps ensure the reliability and maintainability of cloud-native applications. As software architectures become increasingly distributed and subject to change, it becomes a greater challenge to diagnose system issues effectively, often having to deal with fragmented observability and more difficult root cause analysis. This paper builds upon our previous work and introduces three design patterns that address key challenges in monitoring cloud-native applications. Distributed Tracing improves visibility into request flows across services, aiding in latency analysis and root cause detection, Application Metrics provides a structured approach to instrumenting applications with meaningful performance indicators, enabling real-time monitoring and anomaly detection, and Infrastructure Metrics focuses on monitoring the environment in which the system is operated, helping teams assess resource utilization, scalability, and operational health. These patterns are derived from industry practices and observability frameworks and aim to offer guidance for software practitioners.
Related papers
- Verification of Visual Controllers via Compositional Geometric Transformations [49.81690518952909]
We introduce a novel verification framework for perception-based controllers that can generate outer-approximations of reachable sets.<n>We provide theoretical guarantees on the soundness of our method and demonstrate its effectiveness across benchmark control environments.
arXiv Detail & Related papers (2025-07-06T20:22:58Z) - Towards Trustworthy GUI Agents: A Survey [64.6445117343499]
This survey examines the trustworthiness of GUI agents in five critical dimensions.<n>We identify major challenges such as vulnerability to adversarial attacks, cascading failure modes in sequential decision-making.<n>As GUI agents become more widespread, establishing robust safety standards and responsible development practices is essential.
arXiv Detail & Related papers (2025-03-30T13:26:00Z) - Continuous Observability Assurance in Cloud-Native Applications [0.0]
We build on previous work and integrate our observability experiment tool OXN into a novel method for continuous observability assurance.<n>We demonstrate its use and discuss future directions.
arXiv Detail & Related papers (2025-03-11T15:43:26Z) - LLM Assisted Anomaly Detection Service for Site Reliability Engineers: Enhancing Cloud Infrastructure Resilience [5.644170923282226]
This paper introduces a scalable Anomaly Detection Service with a generalizable API tailored for industrial time-series data.<n>We provide insights into the usage patterns of the service, with over 500 users and 200,000 API calls in a year.<n>We plan to extend the system to include time series foundation models, enabling zero-shot anomaly detection capabilities.
arXiv Detail & Related papers (2025-01-28T06:41:37Z) - Profile of Vulnerability Remediations in Dependencies Using Graph
Analysis [40.35284812745255]
This research introduces graph analysis methods and a modified Graph Attention Convolutional Neural Network (GAT) model.
We analyze control flow graphs to profile breaking changes in applications occurring from dependency upgrades intended to remediate vulnerabilities.
Results demonstrate the effectiveness of the enhanced GAT model in offering nuanced insights into the relational dynamics of code vulnerabilities.
arXiv Detail & Related papers (2024-03-08T02:01:47Z) - Metrics reloaded: Recommendations for image analysis validation [59.60445111432934]
Metrics Reloaded is a comprehensive framework guiding researchers in the problem-aware selection of metrics.
The framework was developed in a multi-stage Delphi process and is based on the novel concept of a problem fingerprint.
Based on the problem fingerprint, users are guided through the process of choosing and applying appropriate validation metrics.
arXiv Detail & Related papers (2022-06-03T15:56:51Z) - Benchmarking high-fidelity pedestrian tracking systems for research,
real-time monitoring and crowd control [55.41644538483948]
High-fidelity pedestrian tracking in real-life conditions has been an important tool in fundamental crowd dynamics research.
As this technology advances, it is becoming increasingly useful also in society.
To successfully employ pedestrian tracking techniques in research and technology, it is crucial to validate and benchmark them for accuracy.
We present and discuss a benchmark suite, towards an open standard in the community, for privacy-respectful pedestrian tracking techniques.
arXiv Detail & Related papers (2021-08-26T11:45:26Z) - Towards AIOps in Edge Computing Environments [60.27785717687999]
This paper describes the system design of an AIOps platform which is applicable in heterogeneous, distributed environments.
It is feasible to collect metrics with a high frequency and simultaneously run specific anomaly detection algorithms directly on edge devices.
arXiv Detail & Related papers (2021-02-12T09:33:00Z) - Building an Automated and Self-Aware Anomaly Detection System [0.0]
It can be challenging to proactively monitor a large number of diverse and constantly changing time series for anomalies.
Traditionally, variations in the data generation processes and patterns have required strong modeling expertise to create models that accurately flag anomalies.
In this paper, we describe an anomaly detection system that overcomes this common challenge by keeping track of its own performance and making changes as necessary to each model without requiring manual intervention.
arXiv Detail & Related papers (2020-11-10T11:19:07Z) - Monitoring and explainability of models in production [58.720142291102135]
Monitoring deployed models is crucial for continued provision of high quality machine learning enabled services.
We discuss the challenges to successful implementation of solutions in each of these areas with some recent examples of production ready solutions using open source tools.
arXiv Detail & Related papers (2020-07-13T10:37:05Z) - Collaborative Inference for Efficient Remote Monitoring [34.27630312942825]
A naive approach to resolve this on the model level is to use simpler architectures.
We propose an alternative solution by decomposing the predictive model as the sum of a simple function which serves as a local monitoring tool.
A sign requirement is imposed on the latter to ensure that the local monitoring function is safe.
arXiv Detail & Related papers (2020-02-12T01:57:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.