Related papers: Evaluating LLM-Based Process Explanations under Progressive Behavioral-Input Reduction

Evaluating LLM-Based Process Explanations under Progressive Behavioral-Input Reduction

URL: http://arxiv.org/abs/2510.09732v1
Date: Fri, 10 Oct 2025 13:10:50 GMT
Title: Evaluating LLM-Based Process Explanations under Progressive Behavioral-Input Reduction
Authors: P. van Oerle, R. H. Bemthuis, F. A. Bukhsh,
Abstract summary: Large Language Models (LLMs) are increasingly used to generate explanations of process models discovered from event logs.<n>This paper reports an evaluation of explanation quality under progressive behavioral-input reduction.<n>On synthetic logs, explanation quality is largely preserved under moderate reduction, indicating a practical cost-quality trade-off.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are increasingly used to generate textual explanations of process models discovered from event logs. Producing explanations from large behavioral abstractions (e.g., directly-follows graphs or Petri nets) can be computationally expensive. This paper reports an exploratory evaluation of explanation quality under progressive behavioral-input reduction, where models are discovered from progressively smaller prefixes of a fixed log. Our pipeline (i) discovers models at multiple input sizes, (ii) prompts an LLM to generate explanations, and (iii) uses a second LLM to assess completeness, bottleneck identification, and suggested improvements. On synthetic logs, explanation quality is largely preserved under moderate reduction, indicating a practical cost-quality trade-off. The study is exploratory, as the scores are LLM-based (comparative signals rather than ground truth) and the data are synthetic. The results suggest a path toward more computationally efficient, LLM-assisted process analysis in resource-constrained settings.

Related papers

Step-Level Sparse Autoencoder for Reasoning Process Interpretation [48.99201531966593]
Large Language Models (LLMs) have achieved strong complex reasoning capabilities through Chain-of-Thought (CoT) reasoning.<n>We propose step-level sparse autoencoder (SSAE), which serves as an analytical tool to disentangle different aspects of LLMs' reasoning steps into sparse features.<n> Experiments on multiple base models and reasoning tasks show the effectiveness of the extracted features.
arXiv Detail & Related papers (2026-03-03T14:25:02Z)
Behavior and Representation in Large Language Models for Combinatorial Optimization: From Feature Extraction to Algorithm Selection [2.6285579209051284]
Large Language Models (LLMs) have opened new perspectives for automation in optimization.<n>This study investigates how LLMs internally represent optimization problems and whether such representations can support downstream decision tasks.
arXiv Detail & Related papers (2025-12-15T14:28:35Z)
R-Log: Incentivizing Log Analysis Capability in LLMs via Reasoning-based Reinforcement Learning [19.713020881817588]
R-Log is a novel reasoning-based paradigm that mirrors the structured, step-by-step analytical process of human engineers.<n>R-Log is first cold-started on a curated dataset of 2k+ reasoning trajectories, guided by 13 strategies from manual O&M practices.<n> Empirical evaluations on real-world logs show that R-Log outperforms existing methods across five log analysis tasks.
arXiv Detail & Related papers (2025-09-30T09:19:31Z)
Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs [78.09559830840595]
We present the first systematic study on quantizing diffusion-based language models.<n>We identify the presence of activation outliers, characterized by abnormally large activation values.<n>We implement state-of-the-art PTQ methods and conduct a comprehensive evaluation.
arXiv Detail & Related papers (2025-08-20T17:59:51Z)
END: Early Noise Dropping for Efficient and Effective Context Denoising [60.24648712022382]
Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks.<n>They are often distracted by irrelevant or noisy context in input sequences that degrades output quality.<n>We introduce Early Noise Dropping (textscEND), a novel approach to mitigate this issue without requiring fine-tuning the LLMs.
arXiv Detail & Related papers (2025-02-26T08:07:17Z)
Provenance: A Light-weight Fact-checker for Retrieval Augmented LLM Generation Output [49.893971654861424]
We present a light-weight approach for detecting nonfactual outputs from retrieval-augmented generation (RAG) We compute a factuality score that can be thresholded to yield a binary decision. Our experiments show high area under the ROC curve (AUC) across a wide range of relevant open source datasets.
arXiv Detail & Related papers (2024-11-01T20:44:59Z)
The Graph's Apprentice: Teaching an LLM Low Level Knowledge for Circuit Quality Estimation [34.37154877681809]
This work proposes augmenting large language models (LLMs) with predictor networks trained to estimate circuit quality directly from HDL code.<n>To enhance performance, the model is regularized using embeddings from graph neural networks (GNNs) trained on Look-Up Table (LUT) graphs.<n>The proposed method demonstrates superior performance compared to existing graph-based RTL-level estimation techniques on the established benchmark OpenABCD.
arXiv Detail & Related papers (2024-10-30T04:20:10Z)
Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales. We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z)
DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity. Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data. Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z)
In-Context Symbolic Regression: Leveraging Large Language Models for Function Discovery [5.2387832710686695]
In this work, we introduce the first comprehensive framework that utilizes Large Language Models (LLMs) for the task of Symbolic Regression. We propose In-Context Symbolic Regression (ICSR), an SR method which iteratively refines a functional form with an external LLM and determines its coefficients with an external LLM. Our findings reveal that LLMs are able to successfully find symbolic equations that fit the given data, matching or outperforming the overall performance of the best SR baselines on four popular benchmarks.
arXiv Detail & Related papers (2024-04-29T20:19:25Z)
Evaluating the Factuality of Large Language Models using Large-Scale Knowledge Graphs [30.179703001666173]
Factuality issue is a critical concern for Large Language Models (LLMs) We propose GraphEval to evaluate an LLM's performance using a substantially large test dataset. Test dataset is retrieved from a large knowledge graph with more than 10 million facts without expensive human efforts.
arXiv Detail & Related papers (2024-04-01T06:01:17Z)
How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback. Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities. We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z)
LLMRefine: Pinpointing and Refining Large Language Models via Fine-Grained Actionable Feedback [65.84061725174269]
Recent large language models (LLM) are leveraging human feedback to improve their generation quality. We propose LLMRefine, an inference time optimization method to refine LLM's output. We conduct experiments on three text generation tasks, including machine translation, long-form question answering (QA), and topical summarization. LLMRefine consistently outperforms all baseline approaches, achieving improvements up to 1.7 MetricX points on translation tasks, 8.1 ROUGE-L on ASQA, 2.2 ROUGE-L on topical summarization.
arXiv Detail & Related papers (2023-11-15T19:52:11Z)
ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases. We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets. Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z)
Direct loss minimization algorithms for sparse Gaussian processes [9.041035455989181]
The paper provides a thorough investigation of Direct loss (DLM), which optimize the posterior to minimize predictive loss in sparse Gaussian processes. The application of DLM in non-conjugate cases is more complex because the minimization of expectation in the log-loss DLM objective is often intractable.
arXiv Detail & Related papers (2020-04-07T02:31:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.