Related papers: On the Influence of Data Resampling for Deep Learning-Based Log Anomaly Detection: Insights and Recommendations

On the Influence of Data Resampling for Deep Learning-Based Log Anomaly Detection: Insights and Recommendations

URL: http://arxiv.org/abs/2405.03489v1
Date: Mon, 6 May 2024 14:01:05 GMT
Title: On the Influence of Data Resampling for Deep Learning-Based Log Anomaly Detection: Insights and Recommendations
Authors: Xiaoxue Ma, Huiqi Zou, Jacky Keung, Pinjia He, Yishu Li, Xiao Yu, Federica Sarro,
Abstract summary: Class imbalance in public data commonly used to train Log Anomaly Detection models. Mitigating class imbalance through data resampling has proven effective for other software engineering tasks. This study provides an in-depth analysis of the impact of diverse data resampling methods on existingAD approaches.
Score: 10.931620604044486
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Numerous DL-based approaches have garnered considerable attention in the field of software Log Anomaly Detection. However, a practical challenge persists: the class imbalance in the public data commonly used to train the DL models. This imbalance is characterized by a substantial disparity in the number of abnormal log sequences compared to normal ones, for example, anomalies represent less than 1% of one of the most popular datasets. Previous research has indicated that existing DLLAD approaches may exhibit unsatisfactory performance, particularly when confronted with datasets featuring severe class imbalances. Mitigating class imbalance through data resampling has proven effective for other software engineering tasks, however, it has been unexplored for LAD thus far. This study aims to fill this gap by providing an in-depth analysis of the impact of diverse data resampling methods on existing DLLAD approaches from two distinct perspectives. Firstly, we assess the performance of these DLLAD approaches across three datasets and explore the impact of resampling ratios of normal to abnormal data on ten data resampling methods. Secondly, we evaluate the effectiveness of the data resampling methods when utilizing optimal resampling ratios of normal to abnormal data. Our findings indicate that oversampling methods generally outperform undersampling and hybrid methods. Data resampling on raw data yields superior results compared to data resampling in the feature space. In most cases, certain undersampling and hybrid methods show limited effectiveness. Additionally, by exploring the resampling ratio of normal to abnormal data, we suggest generating more data for minority classes through oversampling while removing less data from majority classes through undersampling. In conclusion, our study provides valuable insights into the intricate relationship between data resampling methods and DLLAD.

Related papers

Unsupervised Anomaly Detection for Tabular Data Using Noise Evaluation [26.312206159418903]
Unsupervised anomaly detection (UAD) plays an important role in modern data analytics. We present a novel UAD method by evaluating how much noise is in the data. We provide theoretical guarantees, proving that the proposed method can detect anomalous data successfully.
arXiv Detail & Related papers (2024-12-16T05:35:58Z)
Deep evolving semi-supervised anomaly detection [14.027613461156864]
The aim of this paper is to formalise the task of continual semi-supervised anomaly detection (CSAD) The paper introduces a baseline model of a variational autoencoder (VAE) to work with semi-supervised data along with a continual learning method of deep generative replay with outlier rejection.
arXiv Detail & Related papers (2024-12-01T15:48:37Z)
A Bilevel Optimization Framework for Imbalanced Data Classification [1.6385815610837167]
We propose a new undersampling approach that avoids the pitfalls of noise and overlap caused by synthetic data. Instead of undersampling majority data randomly, our method undersamples datapoints based on their ability to improve model loss. Using improved model loss as a proxy measurement for classification performance, our technique assesses a datapoint's impact on loss and rejects those unable to improve it.
arXiv Detail & Related papers (2024-10-15T01:17:23Z)
Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [56.24431208419858]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset. We propose an effective yet simple data relabeling method that conditions the preference pairs on quality scores to construct a reward-augmented dataset.
arXiv Detail & Related papers (2024-10-10T16:01:51Z)
Wireless Channel Aware Data Augmentation Methods for Deep Learning-Based Indoor Localization [22.76179980847908]
We propose methods that utilize the domain knowledge about wireless propagation channels and devices. We show that in the low-data regime, localization accuracy increases up to 50%, matching non-augmented results in the high-data regime. The proposed methods may outperform the measurement-only high-data performance by up to 33% using only 1/4 of the amount of measured data.
arXiv Detail & Related papers (2024-08-12T19:01:49Z)
Entropy Law: The Story Behind Data Compression and LLM Performance [115.70395740286422]
We find that model performance is negatively correlated to the compression ratio of training data, which usually yields a lower training loss. Based on the findings of the entropy law, we propose a quite efficient and universal data selection method. We also present an interesting application of entropy law that can detect potential performance risks at the beginning of model training.
arXiv Detail & Related papers (2024-07-09T08:14:29Z)
Leveraging Latent Diffusion Models for Training-Free In-Distribution Data Augmentation for Surface Defect Detection [9.784793380119806]
We introduce DIAG, a training-free Diffusion-based In-distribution Anomaly Generation pipeline for data augmentation. Unlike conventional image generation techniques, we implement a human-in-the-loop pipeline, where domain experts provide multimodal guidance to the model. We demonstrate the efficacy and versatility of DIAG with respect to state-of-the-art data augmentation approaches on the challenging KSDD2 dataset.
arXiv Detail & Related papers (2024-07-04T14:28:52Z)
Temporal Output Discrepancy for Loss Estimation-based Active Learning [65.93767110342502]
We present a novel deep active learning approach that queries the oracle for data annotation when the unlabeled sample is believed to incorporate high loss. Our approach achieves superior performances than the state-of-the-art active learning methods on image classification and semantic segmentation tasks.
arXiv Detail & Related papers (2022-12-20T19:29:37Z)
Causal Deep Reinforcement Learning Using Observational Data [11.790171301328158]
We propose two deconfounding methods in deep reinforcement learning (DRL) The methods first calculate the importance degree of different samples based on the causal inference technique, and then adjust the impact of different samples on the loss function. We prove the effectiveness of our deconfounding methods and validate them experimentally.
arXiv Detail & Related papers (2022-11-28T14:34:39Z)
Scale-Equivalent Distillation for Semi-Supervised Object Detection [57.59525453301374]
Recent Semi-Supervised Object Detection (SS-OD) methods are mainly based on self-training, generating hard pseudo-labels by a teacher model on unlabeled data as supervisory signals. We analyze the challenges these methods meet with the empirical experiment results. We introduce a novel approach, Scale-Equivalent Distillation (SED), which is a simple yet effective end-to-end knowledge distillation framework robust to large object size variance and class imbalance.
arXiv Detail & Related papers (2022-03-23T07:33:37Z)
Deep Stable Learning for Out-Of-Distribution Generalization [27.437046504902938]
Approaches based on deep neural networks have achieved striking performance when testing data and training data share similar distribution. Eliminating the impact of distribution shifts between training and testing data is crucial for building performance-promising deep models. We propose to address this problem by removing the dependencies between features via learning weights for training samples.
arXiv Detail & Related papers (2021-04-16T03:54:21Z)
Bootstrapping Your Own Positive Sample: Contrastive Learning With Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model. We introduce two unique positive sampling strategies specifically tailored for EHR data. Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z)
Negative Data Augmentation [127.28042046152954]
We show that negative data augmentation samples provide information on the support of the data distribution. We introduce a new GAN training objective where we use NDA as an additional source of synthetic data for the discriminator. Empirically, models trained with our method achieve improved conditional/unconditional image generation along with improved anomaly detection capabilities.
arXiv Detail & Related papers (2021-02-09T20:28:35Z)
Provably Efficient Causal Reinforcement Learning with Confounded Observational Data [135.64775986546505]
We study how to incorporate the dataset (observational data) collected offline, which is often abundantly available in practice, to improve the sample efficiency in the online setting. We propose the deconfounded optimistic value iteration (DOVI) algorithm, which incorporates the confounded observational data in a provably efficient manner.
arXiv Detail & Related papers (2020-06-22T14:49:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.