Related papers: Improving Problem Identification via Automated Log Clustering using Dimensionality Reduction

Improving Problem Identification via Automated Log Clustering using Dimensionality Reduction

URL: http://arxiv.org/abs/2009.03257v1
Date: Mon, 7 Sep 2020 17:26:18 GMT
Title: Improving Problem Identification via Automated Log Clustering using Dimensionality Reduction
Authors: Carl Martin Rosenberg and Leon Moonen
Abstract summary: We consider the problem of automatically grouping logs of runs that failed for the same underlying reasons, so that they can be treated more effectively. We ask: Does an approach developed to identify problems in system logs generalize to identifying problems in continuous deployment logs? We also ask: How does the criterion used for merging clusters in the clustering algorithm affect quality?
Score: 0.8122270502556374
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Goal: We consider the problem of automatically grouping logs of runs that failed for the same underlying reasons, so that they can be treated more effectively, and investigate the following questions: (1) Does an approach developed to identify problems in system logs generalize to identifying problems in continuous deployment logs? (2) How does dimensionality reduction affect the quality of automated log clustering? (3) How does the criterion used for merging clusters in the clustering algorithm affect clustering quality? Method: We replicate and extend earlier work on clustering system log files to assess its generalization to continuous deployment logs. We consider the optional inclusion of one of these dimensionality reduction techniques: Principal Component Analysis (PCA), Latent Semantic Indexing (LSI), and Non-negative Matrix Factorization (NMF). Moreover, we consider three alternative cluster merge criteria (Single Linkage, Average Linkage, and Weighted Linkage), in addition to the Complete Linkage criterion used in earlier work. We empirically evaluate the 16 resulting configurations on continuous deployment logs provided by our industrial collaborator. Results: Our study shows that (1) identifying problems in continuous deployment logs via clustering is feasible, (2) including NMF significantly improves overall accuracy and robustness, and (3) Complete Linkage performs best of all merge criteria analyzed. Conclusions: We conclude that problem identification via automated log clustering is improved by including dimensionality reduction, as it decreases the pipeline's sensitivity to parameter choice, thereby increasing its robustness for handling different inputs.

Related papers

CA-AFP: Cluster-Aware Adaptive Federated Pruning [1.345821655503426]
Federated Learning (FL) faces major challenges in real-world deployments due to statistical heterogeneity across clients.<n>We propose CA-AFP, a unified framework that jointly addresses both challenges by performing cluster-specific model pruning.<n>We evaluate CA-AFP on two widely used human activity recognition benchmarks, UCI HAR and WISDM, under natural user-based federated partitions.
arXiv Detail & Related papers (2026-03-02T11:04:25Z)
Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning [53.45095336430027]
We develop a unified framework that combines implicit retrieval and structured collaboration.<n>On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy.<n>Results on SuperGPQA and TRQA confirm robustness across domains.
arXiv Detail & Related papers (2025-09-25T14:05:55Z)
Exact and Heuristic Algorithms for Constrained Biclustering [0.0]
Biclustering, also known as co-clustering or two-way clustering, simultaneously partitions the rows and columns of a data matrix to reveal submatrices with coherent patterns.<n>We study constrained biclustering with pairwise constraints, namely must-link and cannot-link constraints, which specify whether objects should belong to the same or different biclusters.
arXiv Detail & Related papers (2025-08-07T15:29:22Z)
Divide by Question, Conquer by Agent: SPLIT-RAG with Question-Driven Graph Partitioning [62.640169289390535]
SPLIT-RAG is a multi-agent RAG framework that addresses the limitations with question-driven semantic graph partitioning and collaborative subgraph retrieval.<n>The innovative framework first create Semantic Partitioning of Linked Information, then use the Type-Specialized knowledge base to achieve Multi-Agent RAG.<n>The attribute-aware graph segmentation manages to divide knowledge graphs into semantically coherent subgraphs, ensuring subgraphs align with different query types.<n>A hierarchical merging module resolves inconsistencies across subgraph-derived answers through logical verifications.
arXiv Detail & Related papers (2025-05-20T06:44:34Z)
Adaptive and Robust DBSCAN with Multi-agent Reinforcement Learning [53.527506374566485]
We propose a novel Adaptive and Robust DBSCAN with Multi-agent Reinforcement Learning cluster framework, namely AR-DBSCAN.<n>We show that AR-DBSCAN not only improves clustering accuracy by up to 144.1% and 175.3% in the NMI and ARI metrics, respectively, but also is capable of robustly finding dominant parameters.
arXiv Detail & Related papers (2025-05-07T11:37:23Z)
AnomalyGen: An Automated Semantic Log Sequence Generation Framework with LLM for Anomaly Detection [25.83270938475311]
AnomalyGen is the first automated log synthesis framework specifically designed for anomaly detection. Our framework integrates enhanced program analysis with Chain-of-Thought reasoning (CoT reasoning) to enable iterative log generation and anomaly annotation. When augmenting benchmark datasets with synthesized logs, we observe maximum F1-score improvements of 3.7%.
arXiv Detail & Related papers (2025-04-16T16:54:38Z)
Consistent Amortized Clustering via Generative Flow Networks [6.9158153233702935]
We introduce GFNCP, a novel framework for amortized clustering. GFNCP is formulated as a Generative Flow Network with a shared energy-based parametrization of policy and reward. We show that the flow matching conditions are equivalent to consistency of the clustering posterior under marginalization.
arXiv Detail & Related papers (2025-02-26T17:30:52Z)
Self-Supervised Graph Embedding Clustering [70.36328717683297]
K-means one-step dimensionality reduction clustering method has made some progress in addressing the curse of dimensionality in clustering tasks. We propose a unified framework that integrates manifold learning with K-means, resulting in the self-supervised graph embedding framework.
arXiv Detail & Related papers (2024-09-24T08:59:51Z)
A3S: A General Active Clustering Method with Pairwise Constraints [66.74627463101837]
A3S features strategic active clustering adjustment on the initial cluster result, which is obtained by an adaptive clustering algorithm. In extensive experiments across diverse real-world datasets, A3S achieves desired results with significantly fewer human queries.
arXiv Detail & Related papers (2024-07-14T13:37:03Z)
Towards a connection between the capacitated vehicle routing problem and the constrained centroid-based clustering [1.3927943269211591]
Efficiently solving a vehicle routing problem in a practical runtime is a critical challenge for delivery management companies. This paper explores both a theoretical and experimental connection between the Capacitated Vehicle Problem (CVRP) and the Constrainedid-Based Clustering (CCBC) The proposed framework consists of three stages. At the first step, a constrained centroid-based clustering algorithm generates feasible clusters of customers.
arXiv Detail & Related papers (2024-03-20T22:24:36Z)
Sample-Efficient Clustering and Conquer Procedures for Parallel Large-Scale Ranking and Selection [0.0]
In parallel computing environments, correlation-based clustering can achieve an $mathcalO(p)$ sample complexity reduction rate. In large-scale AI applications such as neural architecture search, a screening-free version of our procedure surprisingly surpasses fully-sequential benchmarks in terms of sample efficiency.
arXiv Detail & Related papers (2024-02-03T15:56:03Z)
Rethinking Clustering-Based Pseudo-Labeling for Unsupervised Meta-Learning [146.11600461034746]
Method for unsupervised meta-learning, CACTUs, is a clustering-based approach with pseudo-labeling. This approach is model-agnostic and can be combined with supervised algorithms to learn from unlabeled data. We prove that the core reason for this is lack of a clustering-friendly property in the embedding space.
arXiv Detail & Related papers (2022-09-27T19:04:36Z)
Near-Optimal Correlation Clustering with Privacy [37.94795032297396]
Correlation clustering is a central problem in unsupervised learning. In this paper, we introduce a simple and computationally efficient algorithm for the correlation clustering problem with provable privacy guarantees.
arXiv Detail & Related papers (2022-03-02T22:30:19Z)
Meta Clustering Learning for Large-scale Unsupervised Person Re-identification [124.54749810371986]
We propose a "small data for big task" paradigm dubbed Meta Clustering Learning (MCL) MCL only pseudo-labels a subset of the entire unlabeled data via clustering to save computing for the first-phase training. Our method significantly saves computational cost while achieving a comparable or even better performance compared to prior works.
arXiv Detail & Related papers (2021-11-19T04:10:18Z)
Applying Semi-Automated Hyperparameter Tuning for Clustering Algorithms [0.0]
This study proposes a framework for semi-automated hyperparameter tuning of clustering problems. It uses a grid search to develop a series of graphs and easy to interpret metrics that can then be used for more efficient domain-specific evaluation. Preliminary results show that internal metrics are unable to capture the semantic quality of the clusters developed.
arXiv Detail & Related papers (2021-08-25T05:48:06Z)
Rethinking Graph Autoencoder Models for Attributed Graph Clustering [1.2158275183241178]
Graph Auto-Encoders (GAEs) have been used to perform joint clustering and embedding learning. We study the accumulative error, inflicted by learning with noisy clustering assignments, and reconstructing the adjacency matrix. We propose a sampling operator $Xi$ that triggers a protection mechanism against the noisy clustering assignments.
arXiv Detail & Related papers (2021-07-19T00:00:35Z)
Channel DropBlock: An Improved Regularization Method for Fine-Grained Visual Classification [58.07257910065007]
Existing approaches mainly tackle this problem by introducing attention mechanisms to locate the discriminative parts or feature encoding approaches to extract the highly parameterized features in a weakly-supervised fashion. In this work, we propose a lightweight yet effective regularization method named Channel DropBlock (CDB) in combination with two alternative correlation metrics, to address this problem.
arXiv Detail & Related papers (2021-06-07T09:03:02Z)
Towards Uncovering the Intrinsic Data Structures for Unsupervised Domain Adaptation using Structurally Regularized Deep Clustering [119.88565565454378]
Unsupervised domain adaptation (UDA) is to learn classification models that make predictions for unlabeled data on a target domain. We propose a hybrid model of Structurally Regularized Deep Clustering, which integrates the regularized discriminative clustering of target data with a generative one. Our proposed H-SRDC outperforms all the existing methods under both the inductive and transductive settings.
arXiv Detail & Related papers (2020-12-08T08:52:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.