Related papers: Kunlun Anomaly Troubleshooter: Enabling Kernel-Level Anomaly Detection and Causal Reasoning for Large Model Distributed Inference

Kunlun Anomaly Troubleshooter: Enabling Kernel-Level Anomaly Detection and Causal Reasoning for Large Model Distributed Inference

URL: http://arxiv.org/abs/2511.05978v1
Date: Sat, 08 Nov 2025 11:53:08 GMT
Title: Kunlun Anomaly Troubleshooter: Enabling Kernel-Level Anomaly Detection and Causal Reasoning for Large Model Distributed Inference
Authors: Yuyang Liu, Jingjing Cai, Jiayi Ren, Peng Zhou, Danyang Zhang, Yin Du, Shijian Li,
Abstract summary: Anomaly troubleshooting for large model distributed inference (LMDI) remains a critical challenge.<n>We introduce Kunlun Anomaly Troubleshooter (KAT), the first anomaly troubleshooting framework tailored for LMDI.
Score: 15.448826510384302
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Anomaly troubleshooting for large model distributed inference (LMDI) remains a critical challenge. Resolving anomalies such as inference performance degradation or latency jitter in distributed system demands significant manual efforts from domain experts, resulting in extremely time-consuming diagnosis processes with relatively low accuracy. In this paper, we introduce Kunlun Anomaly Troubleshooter (KAT), the first anomaly troubleshooting framework tailored for LMDI. KAT addresses this problem through two core innovations. First, KAT exploits the synchronicity and consistency of GPU workers, innovatively leverages function trace data to precisely detect kernel-level anomalies and associated hardware components at nanosecond resolution. Second, KAT integrates these detection results into a domain-adapted LLM, delivering systematic causal reasoning and natural language interpretation of complex anomaly symptoms. Evaluations conducted in Alibaba Cloud Service production environment indicate that KAT achieves over 0.884 precision and 0.936 recall in anomaly detection, providing detail anomaly insights that significantly narrow down the diagnostic scope and improve both the efficiency and success rate of troubleshooting.

Related papers

CLIP Meets Diffusion: A Synergistic Approach to Anomaly Detection [49.11819337853632]
Anomaly detection is a complex problem due to the ambiguity in defining anomalies, the diversity of anomaly types, and the scarcity of training data.<n>We propose CLIPfusion, a method that leverages both discriminative and generative foundation models.<n>We believe that our method underscores the effectiveness of multi-modal and multi-model fusion in tackling the multifaceted challenges of anomaly detection.
arXiv Detail & Related papers (2025-06-13T13:30:15Z)
Enhancing Web Service Anomaly Detection via Fine-grained Multi-modal Association and Frequency Domain Analysis [8.860339665670255]
Anomaly detection is crucial for ensuring the stability and reliability of web service systems.<n>Existing anomaly detection methods use logs and metrics to detect anomalies.<n>We propose a novel anomaly detection method named FFAD to address these two issues.
arXiv Detail & Related papers (2025-01-28T12:00:45Z)
Enhanced Fault Detection and Cause Identification Using Integrated Attention Mechanism [0.3749861135832073]
This study introduces a novel methodology for fault detection and cause identification within the Tennessee Eastman Process (TEP) by integrating a Bidirectional Long Short-Term Memory (BiLSTM) neural network with an Integrated Attention Mechanism (IAM) The IAM combines the strengths of scaled dot product attention, residual attention, and dynamic attention to capture intricate patterns and dependencies crucial for TEP fault detection. The BiLSTM network processes these features bidirectionally to capture long-range dependencies, and the IAM further refines the output, leading to improved fault detection results.
arXiv Detail & Related papers (2024-07-31T12:01:57Z)
Feature Attenuation of Defective Representation Can Resolve Incomplete Masking on Anomaly Detection [1.0358639819750703]
In unsupervised anomaly detection (UAD) research, it is necessary to develop a computationally efficient and scalable solution. We revisit the reconstruction-by-inpainting approach and rethink to improve it by analyzing strengths and weaknesses. We propose Feature Attenuation of Defective Representation (FADeR) that only employs two layers which attenuates feature information of anomaly reconstruction.
arXiv Detail & Related papers (2024-07-05T15:44:53Z)
Self-supervised Feature Adaptation for 3D Industrial Anomaly Detection [59.41026558455904]
We focus on multi-modal anomaly detection. Specifically, we investigate early multi-modal approaches that attempted to utilize models pre-trained on large-scale visual datasets. We propose a Local-to-global Self-supervised Feature Adaptation (LSFA) method to finetune the adaptors and learn task-oriented representation toward anomaly detection.
arXiv Detail & Related papers (2024-01-06T07:30:41Z)
ImDiffusion: Imputed Diffusion Models for Multivariate Time Series Anomaly Detection [44.21198064126152]
We propose a novel anomaly detection framework named ImDiffusion. ImDiffusion combines time series imputation and diffusion models to achieve accurate and robust anomaly detection. We evaluate the performance of ImDiffusion via extensive experiments on benchmark datasets.
arXiv Detail & Related papers (2023-07-03T04:57:40Z)
The role of noise in denoising models for anomaly detection in medical images [62.0532151156057]
Pathological brain lesions exhibit diverse appearance in brain images. Unsupervised anomaly detection approaches have been proposed using only normal data for training. We show that optimization of the spatial resolution and magnitude of the noise improves the performance of different model training regimes.
arXiv Detail & Related papers (2023-01-19T21:39:38Z)
Are we certain it's anomalous? [57.729669157989235]
Anomaly detection in time series is a complex task since anomalies are rare due to highly non-linear temporal correlations. Here we propose the novel use of Hyperbolic uncertainty for Anomaly Detection (HypAD) HypAD learns self-supervisedly to reconstruct the input signal.
arXiv Detail & Related papers (2022-11-16T21:31:39Z)
Causality-Based Multivariate Time Series Anomaly Detection [63.799474860969156]
We formulate the anomaly detection problem from a causal perspective and view anomalies as instances that do not follow the regular causal mechanism to generate the multivariate data. We then propose a causality-based anomaly detection approach, which first learns the causal structure from data and then infers whether an instance is an anomaly relative to the local causal mechanism. We evaluate our approach with both simulated and public datasets as well as a case study on real-world AIOps applications.
arXiv Detail & Related papers (2022-06-30T06:00:13Z)
HURRA! Human readable router anomaly detection [11.564082628014638]
HURRA aims to reduce the time spent by human operators in the process of network troubleshooting. It comprises two modules that are plugged after any anomaly detection algorithm.
arXiv Detail & Related papers (2021-07-23T08:38:29Z)
TadGAN: Time Series Anomaly Detection Using Generative Adversarial Networks [73.01104041298031]
TadGAN is an unsupervised anomaly detection approach built on Generative Adversarial Networks (GANs) To capture the temporal correlations of time series, we use LSTM Recurrent Neural Networks as base models for Generators and Critics. To demonstrate the performance and generalizability of our approach, we test several anomaly scoring techniques and report the best-suited one.
arXiv Detail & Related papers (2020-09-16T15:52:04Z)
SUOD: Accelerating Large-Scale Unsupervised Heterogeneous Outlier Detection [63.253850875265115]
Outlier detection (OD) is a key machine learning (ML) task for identifying abnormal objects from general samples. We propose a modular acceleration system, called SUOD, to address it.
arXiv Detail & Related papers (2020-03-11T00:22:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.