Exploring the Effectiveness of LLMs in Automated Logging Generation: An Empirical Study
- URL: http://arxiv.org/abs/2307.05950v2
- Date: Mon, 1 Apr 2024 12:19:55 GMT
- Title: Exploring the Effectiveness of LLMs in Automated Logging Generation: An Empirical Study
- Authors: Yichen Li, Yintong Huo, Zhihan Jiang, Renyi Zhong, Pinjia He, Yuxin Su, Lionel Briand, Michael R. Lyu,
- Abstract summary: This paper performs the first study on exploring large language models (LLMs) for logging statement generation.
We first build a logging statement generation dataset, LogBench, with two parts: (1) LogBench-O: logging statements collected from GitHub repositories, and (2) LogBench-T: the transformed unseen code from LogBench-O.
- Score: 32.53659676826846
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Automated logging statement generation supports developers in documenting critical software runtime behavior. Given the great success in natural language generation and programming language comprehension, large language models (LLMs) might help developers generate logging statements, but this has not yet been investigated. To fill the gap, this paper performs the first study on exploring LLMs for logging statement generation.We first build a logging statement generation dataset, LogBench, with two parts: (1) LogBench-O: logging statements collected from GitHub repositories, and (2) LogBench-T: the transformed unseen code from LogBench-O. Then, we leverage LogBench to evaluate the effectiveness and generalization capabilities (using LogBench-T) of eleven top-performing LLMs. In addition, we examine the performance of these LLMs against classical retrieval-based and machine learning-based logging methods from the era preceding LLMs. We further evaluate LLM's logging generalization capabilities using unseen data (LogBench-T) derived from code transformation techniques. While existing LLMs deliver decent predictions on logging levels and logging variables, our study indicates that they only achieve a maximum BLEU score of 0.249, thus calling for improvements. The paper also highlights the importance of prompt constructions and external factors (e.g., programming contexts and code comments) for LLMs' logging performance. Based on these findings, we identify five implications and provide practical advice for future logging research. Our empirical analysis discloses the limitations of current logging approaches while showcasing the potential of LLM-based logging tools, and provides actionable guidance for building more practical models.
Related papers
- RuAG: Learned-rule-augmented Generation for Large Language Models [62.64389390179651]
We propose a novel framework, RuAG, to automatically distill large volumes of offline data into interpretable first-order logic rules.
We evaluate our framework on public and private industrial tasks, including natural language processing, time-series, decision-making, and industrial tasks.
arXiv Detail & Related papers (2024-11-04T00:01:34Z) - Studying and Benchmarking Large Language Models For Log Level Suggestion [49.176736212364496]
Large Language Models (LLMs) have become a focal point of research across various domains.
This paper investigates the impact of characteristics and learning paradigms on the performance of 12 open-source LLMs in log level suggestion.
arXiv Detail & Related papers (2024-10-11T03:52:17Z) - DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity.
Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data.
Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z) - LUNAR: Unsupervised LLM-based Log Parsing [34.344687402936835]
We propose LUNAR, an unsupervised-based method for efficient and off-the-shelf log parsing.
Our key insight is that while LLMs may struggle with direct log parsing, their performance can be significantly enhanced through comparative analysis.
Experiments on large-scale public datasets demonstrate that LUNAR significantly outperforms state-of-the-art log crafts in terms of accuracy and efficiency.
arXiv Detail & Related papers (2024-06-11T11:32:01Z) - Log Parsing with Self-Generated In-Context Learning and Self-Correction [15.93927602769091]
Despite a variety of log parsing methods that have been proposed, their performance on evolving log data remains unsatisfactory due to reliance on human-crafted rules or learning-based models with limited training data.
We propose Ada, an effective and adaptive log parsing framework using LLMs with self-generated in-context learning (SG-ICL) and self-correction.
arXiv Detail & Related papers (2024-06-05T15:31:43Z) - LLMParser: An Exploratory Study on Using Large Language Models for Log Parsing [8.647406441990396]
We study the potential of using Large Language Models (LLMs) for log parsing and propose an LLM-based log based on generative inferences and few-shot tuning.
We find that smaller LLMs may be more effective than more complex LLMs; for instance where Flan-T5-base achieves comparable results as LLaMA-7B with a shorter time.
We also find that using LLMs pre-trained using logs from other systems does not always improve parsing accuracy.
arXiv Detail & Related papers (2024-04-27T20:34:29Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - LILAC: Log Parsing using LLMs with Adaptive Parsing Cache [38.04960745458878]
We propose LILAC, the first practical log parsing framework using large language models (LLMs) with adaptive parsing cache.
LLMs's lack of specialized log parsing capabilities currently hinders their accuracy in parsing.
We show LILAC outperforms state-of-the-art methods by 69.5% in terms of the average F1 score of template accuracy.
arXiv Detail & Related papers (2023-10-03T04:46:59Z) - Exploring Self-supervised Logic-enhanced Training for Large Language Models [59.227222647741094]
In this paper, we make the first attempt to investigate the feasibility of incorporating logical knowledge through self-supervised post-training.
We devise an auto-regressive objective variant of MERIt and integrate it with two LLM series, i.e., FLAN-T5 and LLaMA, with parameter size ranging from 3 billion to 13 billion.
The results on two challenging logical reasoning benchmarks demonstrate the effectiveness of LogicLLM.
arXiv Detail & Related papers (2023-05-23T06:13:10Z) - Self-Supervised Log Parsing [59.04636530383049]
Large-scale software systems generate massive volumes of semi-structured log records.
Existing approaches rely on log-specifics or manual rule extraction.
We propose NuLog that utilizes a self-supervised learning model and formulates the parsing task as masked language modeling.
arXiv Detail & Related papers (2020-03-17T19:25:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.