Related papers: Root Causing Prediction Anomalies Using Explainable AI

Root Causing Prediction Anomalies Using Explainable AI

URL: http://arxiv.org/abs/2403.02439v1
Date: Mon, 4 Mar 2024 19:38:50 GMT
Title: Root Causing Prediction Anomalies Using Explainable AI
Authors: Ramanathan Vishnampet, Rajesh Shenoy, Jianhui Chen, Anuj Gupta
Abstract summary: We present a novel application of explainable AI (XAI) for root-causing performance degradation in machine learning models. A single feature corruption can cause cascading feature, label and concept drifts. We have successfully applied this technique to improve the reliability of models used in personalized advertising.
Score: 3.970146574042422
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: This paper presents a novel application of explainable AI (XAI) for root-causing performance degradation in machine learning models that learn continuously from user engagement data. In such systems a single feature corruption can cause cascading feature, label and concept drifts. We have successfully applied this technique to improve the reliability of models used in personalized advertising. Performance degradation in such systems manifest as prediction anomalies in the models. These models are typically trained continuously using features that are produced by hundreds of real time data processing pipelines or derived from other upstream models. A failure in any of these pipelines or an instability in any of the upstream models can cause feature corruption, causing the model's predicted output to deviate from the actual output and the training data to become corrupted. The causal relationship between the features and the predicted output is complex, and root-causing is challenging due to the scale and dynamism of the system. We demonstrate how temporal shifts in the global feature importance distribution can effectively isolate the cause of a prediction anomaly, with better recall than model-to-feature correlation methods. The technique appears to be effective even when approximating the local feature importance using a simple perturbation-based method, and aggregating over a few thousand examples. We have found this technique to be a model-agnostic, cheap and effective way to monitor complex data pipelines in production and have deployed a system for continuously analyzing the global feature importance distribution of continuously trained models.

Related papers

Time-Series Learning for Proactive Fault Prediction in Distributed Systems with Deep Neural Structures [5.572536027964037]
This paper addresses the challenges of fault prediction and delayed response in distributed systems.<n>We use a Gated Recurrent Unit to model the evolution of system states over time.<n>An attention mechanism is then applied to enhance key temporal segments, improving the model's ability to identify potential faults.
arXiv Detail & Related papers (2025-05-27T04:31:12Z)
Self-attention-based Diffusion Model for Time-series Imputation in Partial Blackout Scenarios [23.160007389272575]
Missing values in time series data can harm machine learning performance and introduce bias. Previous work has tackled the imputation of missing data in random, complete blackouts and forecasting scenarios. We introduce a two-stage imputation process using self-attention and diffusion processes to model feature and temporal correlations.
arXiv Detail & Related papers (2025-03-03T16:58:15Z)
Multivariate Data Augmentation for Predictive Maintenance using Diffusion [35.286105732902065]
Predictive maintenance has been used to optimize system repairs in the industrial, medical, and financial domains. There is a lack of fault data to train these models, due to organizations working to keep fault occurrences and down time to a minimum. For newly installed systems, no fault data exists since they have yet to fail.
arXiv Detail & Related papers (2024-11-06T16:57:09Z)
Fairness Feedback Loops: Training on Synthetic Data Amplifies Bias [47.79659355705916]
Model-induced distribution shifts (MIDS) occur as previous model outputs pollute new model training sets over generations of models. We introduce a framework that allows us to track multiple MIDS over many generations, finding that they can lead to loss in performance, fairness, and minoritized group representation. Despite these negative consequences, we identify how models might be used for positive, intentional, interventions in their data ecosystems.
arXiv Detail & Related papers (2024-03-12T17:48:08Z)
Multi-modal Causal Structure Learning and Root Cause Analysis [67.67578590390907]
We propose Mulan, a unified multi-modal causal structure learning method for root cause localization. We leverage a log-tailored language model to facilitate log representation learning, converting log sequences into time-series data. We also introduce a novel key performance indicator-aware attention mechanism for assessing modality reliability and co-learning a final causal graph.
arXiv Detail & Related papers (2024-02-04T05:50:38Z)
Towards Continually Learning Application Performance Models [1.2278517240988065]
Machine learning-based performance models are increasingly being used to build critical job scheduling and application optimization decisions. Traditionally, these models assume that data distribution does not change as more samples are collected over time. We develop continually learning performance models that account for the distribution drift, alleviate catastrophic forgetting, and improve generalizability.
arXiv Detail & Related papers (2023-10-25T20:48:46Z)
A hybrid feature learning approach based on convolutional kernels for ATM fault prediction using event-log data [5.859431341476405]
We present a predictive model based on a convolutional kernel (MiniROCKET and HYDRA) to extract features from event-log data. The proposed methodology is applied to a significant real-world collected dataset. The model was integrated into a container-based decision support system to support operators in the timely maintenance of ATMs.
arXiv Detail & Related papers (2023-05-17T08:55:53Z)
A Framework for Machine Learning of Model Error in Dynamical Systems [7.384376731453594]
We present a unifying framework for blending mechanistic and machine-learning approaches to identify dynamical systems from data. We cast the problem in both continuous- and discrete-time, for problems in which the model error is memoryless and in which it has significant memory. We find that hybrid methods substantially outperform solely data-driven approaches in terms of data hunger, demands for model complexity, and overall predictive performance.
arXiv Detail & Related papers (2021-07-14T12:47:48Z)
Churn Reduction via Distillation [54.5952282395487]
We show an equivalence between training with distillation using the base model as the teacher and training with an explicit constraint on the predictive churn. We then show that distillation performs strongly for low churn training against a number of recent baselines.
arXiv Detail & Related papers (2021-06-04T18:03:31Z)
Accurate and Robust Feature Importance Estimation under Distribution Shifts [49.58991359544005]
PRoFILE is a novel feature importance estimation method. We show significant improvements over state-of-the-art approaches, both in terms of fidelity and robustness.
arXiv Detail & Related papers (2020-09-30T05:29:01Z)
Learning Causal Models Online [103.87959747047158]
Predictive models can rely on spurious correlations in the data for making predictions. One solution for achieving strong generalization is to incorporate causal structures in the models. We propose an online algorithm that continually detects and removes spurious features.
arXiv Detail & Related papers (2020-06-12T20:49:20Z)
How Training Data Impacts Performance in Learning-based Control [67.7875109298865]
This paper derives an analytical relationship between the density of the training data and the control performance. We formulate a quality measure for the data set, which we refer to as $rho$-gap. We show how the $rho$-gap can be applied to a feedback linearizing control law.
arXiv Detail & Related papers (2020-05-25T12:13:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.