Lumos: A Library for Diagnosing Metric Regressions in Web-Scale
Applications
- URL: http://arxiv.org/abs/2006.12793v1
- Date: Tue, 23 Jun 2020 07:02:07 GMT
- Title: Lumos: A Library for Diagnosing Metric Regressions in Web-Scale
Applications
- Authors: Jamie Pool, Ebrahim Beyrami, Vishak Gopal, Ashkan Aazami, Jayant
Gupchup, Jeff Rowland, Binlong Li, Pritesh Kanani, Ross Cutler, and Johannes
Gehrke
- Abstract summary: We present Lumos, a Python library built using the principles of AB testing to systematically diagnose metric regressions.
Lumos has been deployed across the component teams in Microsoft's Real-Time Communication applications Skype and Microsoft Teams.
It has enabled engineering teams to detect 100s of real changes in metrics and reject 1000s of false alarms detected by anomaly detectors.
- Score: 13.52733069152118
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Web-scale applications can ship code on a daily to weekly cadence. These
applications rely on online metrics to monitor the health of new releases.
Regressions in metric values need to be detected and diagnosed as early as
possible to reduce the disruption to users and product owners. Regressions in
metrics can surface due to a variety of reasons: genuine product regressions,
changes in user population, and bias due to telemetry loss (or processing) are
among the common causes. Diagnosing the cause of these metric regressions is
costly for engineering teams as they need to invest time in finding the root
cause of the issue as soon as possible. We present Lumos, a Python library
built using the principles of AB testing to systematically diagnose metric
regressions to automate such analysis. Lumos has been deployed across the
component teams in Microsoft's Real-Time Communication applications Skype and
Microsoft Teams. It has enabled engineering teams to detect 100s of real
changes in metrics and reject 1000s of false alarms detected by anomaly
detectors. The application of Lumos has resulted in freeing up as much as 95%
of the time allocated to metric-based investigations. In this work, we open
source Lumos and present our results from applying it to two different
components within the RTC group over millions of sessions. This general library
can be coupled with any production system to manage the volume of alerting
efficiently.
Related papers
- AutoMetrics: Approximate Human Judgements with Automatically Generated Evaluators [57.003100107659684]
AutoMetrics is a framework for synthesizing evaluation metrics under low-data constraints.<n>We show that AutoMetrics can be used as a proxy reward to equal effect as a verifiable reward.
arXiv Detail & Related papers (2025-12-19T06:32:46Z) - Position: All Current Generative Fidelity and Diversity Metrics are Flawed [58.815519650465774]
We show that all current generative fidelity and diversity metrics are flawed.<n>Our aim is to convince the research community to spend more effort in developing metrics, instead of models.
arXiv Detail & Related papers (2025-05-28T15:10:33Z) - Automatically Detecting Numerical Instability in Machine Learning Applications via Soft Assertions [7.893728124841138]
numerical bugs can lead to system crashes, incorrect output, and wasted computing resources.
We introduce a novel idea, namely soft assertions (SA), to encode safety/error conditions for the places where numerical instability can occur.
arXiv Detail & Related papers (2025-04-22T00:55:33Z) - Reinforcement Learning for Long-Horizon Interactive LLM Agents [56.9860859585028]
Interactive digital agents (IDAs) leverage APIs of stateful digital environments to perform tasks in response to user requests.
We present a reinforcement learning (RL) approach that trains IDAs directly in their target environments.
We derive LOOP, a data- and memory-efficient variant of proximal policy optimization.
arXiv Detail & Related papers (2025-02-03T18:35:42Z) - Towards Realistic Evaluation of Commit Message Generation by Matching Online and Offline Settings [77.20838441870151]
Commit message generation is a crucial task in software engineering that is challenging to evaluate correctly.
We use an online metric - the number of edits users introduce before committing the generated messages to the VCS - to select metrics for offline experiments.
Our results indicate that edit distance exhibits the highest correlation, whereas commonly used similarity metrics such as BLEU and METEOR demonstrate low correlation.
arXiv Detail & Related papers (2024-10-15T20:32:07Z) - A Comprehensive Library for Benchmarking Multi-class Visual Anomaly Detection [52.228708947607636]
This paper introduces a comprehensive visual anomaly detection benchmark, ADer, which is a modular framework for new methods.
The benchmark includes multiple datasets from industrial and medical domains, implementing fifteen state-of-the-art methods and nine comprehensive metrics.
We objectively reveal the strengths and weaknesses of different methods and provide insights into the challenges and future directions of multi-class visual anomaly detection.
arXiv Detail & Related papers (2024-06-05T13:40:07Z) - Rapid Adaptation in Online Continual Learning: Are We Evaluating It
Right? [135.71855998537347]
We revisit the common practice of evaluating adaptation of Online Continual Learning (OCL) algorithms through the metric of online accuracy.
We show that this metric is unreliable, as even vacuous blind classifiers can achieve unrealistically high online accuracy.
Existing OCL algorithms can also achieve high online accuracy, but perform poorly in retaining useful information.
arXiv Detail & Related papers (2023-05-16T08:29:33Z) - CADeSH: Collaborative Anomaly Detection for Smart Homes [17.072108188004396]
We propose a two-step collaborative anomaly detection method.
It first uses an autoencoder to differentiate frequent (benign') and infrequent (possibly malicious') traffic flows.
Clustering is then used to analyze only the infrequent flows and classify them as either known ('rare yet benign') or unknown (malicious')
arXiv Detail & Related papers (2023-03-02T07:22:26Z) - CMMD: Cross-Metric Multi-Dimensional Root Cause Analysis [17.755405467437637]
In large-scale online services, crucial metrics, a.k.a., key performance indicators (KPIs) are monitored periodically to check their running statuses.
Once abnormal values are observed, root cause analysis (RCA) can be applied to identify the reasons for anomalies.
We propose a cross-metric multi-dimensional root cause analysis method, named CMMD, which consists of two key components.
arXiv Detail & Related papers (2022-03-30T13:17:19Z) - Using sequential drift detection to test the API economy [4.056434158960926]
API economy refers to the widespread integration of API (advanced programming interface)
It is desirable to monitor the usage patterns and identify when the system is used in a way that was never used before.
In this work we analyze both histograms and call graph of API usage to determine if the usage patterns of the system has shifted.
arXiv Detail & Related papers (2021-11-09T13:24:19Z) - Automated User Experience Testing through Multi-Dimensional Performance
Impact Analysis [0.0]
We propose a novel automated user experience testing methodology.
It learns how code changes impact the time unit and system tests take, and extrapolates user experience changes based on this information.
Our open-source tool achieved 3.7% mean absolute error rate with a random forest regressor.
arXiv Detail & Related papers (2021-04-08T01:18:01Z) - TELESTO: A Graph Neural Network Model for Anomaly Classification in
Cloud Services [77.454688257702]
Machine learning (ML) and artificial intelligence (AI) are applied on IT system operation and maintenance.
One direction aims at the recognition of re-occurring anomaly types to enable remediation automation.
We propose a method that is invariant to dimensionality changes of given data.
arXiv Detail & Related papers (2021-02-25T14:24:49Z) - Superiority of Simplicity: A Lightweight Model for Network Device
Workload Prediction [58.98112070128482]
We propose a lightweight solution for series prediction based on historic observations.
It consists of a heterogeneous ensemble method composed of two models - a neural network and a mean predictor.
It achieves an overall $R2$ score of 0.10 on the available FedCSIS 2020 challenge dataset.
arXiv Detail & Related papers (2020-07-07T15:44:16Z) - Learning to Evaluate Perception Models Using Planner-Centric Metrics [104.33349410009161]
We propose a principled metric for 3D object detection specifically for the task of self-driving.
We find that our metric penalizes many of the mistakes that other metrics penalize by design.
For human evaluation, we generate scenes in which standard metrics and our metric disagree and find that humans side with our metric 79% of the time.
arXiv Detail & Related papers (2020-04-19T02:14:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.