Related papers: Evaluating the Process Modeling Abilities of Large Language Models -- Preliminary Foundations and Results

Evaluating the Process Modeling Abilities of Large Language Models -- Preliminary Foundations and Results

URL: http://arxiv.org/abs/2503.13520v1
Date: Fri, 14 Mar 2025 18:52:18 GMT
Title: Evaluating the Process Modeling Abilities of Large Language Models -- Preliminary Foundations and Results
Authors: Peter Fettke, Constantin Houy,
Abstract summary: Large language models (LLM) have revolutionized the processing of natural language.<n>It is currently under debate to what extent an LLM can generate good process models.<n>We discuss these challenges in detail and discuss future experiments to tackle these challenges scientifically.
Score: 1.3812010983144802
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large language models (LLM) have revolutionized the processing of natural language. Although first benchmarks of the process modeling abilities of LLM are promising, it is currently under debate to what extent an LLM can generate good process models. In this contribution, we argue that the evaluation of the process modeling abilities of LLM is far from being trivial. Hence, available evaluation results must be taken carefully. For example, even in a simple scenario, not only the quality of a model should be taken into account, but also the costs and time needed for generation. Thus, an LLM does not generate one optimal solution, but a set of Pareto-optimal variants. Moreover, there are several further challenges which have to be taken into account, e.g. conceptualization of quality, validation of results, generalizability, and data leakage. We discuss these challenges in detail and discuss future experiments to tackle these challenges scientifically.

Related papers

Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.<n>One core challenge of evaluation in the large language model (LLM) era is the generalization issue.<n>We propose Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores.
arXiv Detail & Related papers (2025-04-10T04:09:47Z)
Efficient Model Selection for Time Series Forecasting via LLMs [52.31535714387368]
We propose to leverage Large Language Models (LLMs) as a lightweight alternative for model selection. Our method eliminates the need for explicit performance matrices by utilizing the inherent knowledge and reasoning capabilities of LLMs.
arXiv Detail & Related papers (2025-04-02T20:33:27Z)
ModiGen: A Large Language Model-Based Workflow for Multi-Task Modelica Code Generation [26.965467452327445]
Large language models (LLMs) have demonstrated promising capabilities in code generation, but their application to modeling remains largely unexplored. Our evaluation reveals substantial limitations in current LLMs, as the generated code often fails to simulate successfully. We propose a specialized workflow that integrates supervised fine-tuning, graph retrieval-augmented generation, and feedback optimization to improve the accuracy and reliability of Modelica code generation.
arXiv Detail & Related papers (2025-03-24T09:04:49Z)
Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-Oasis [78.07225438556203]
We introduce LLM-Oasis, the largest resource for training end-to-end factuality evaluators. It is constructed by extracting claims from Wikipedia, falsifying a subset of these claims, and generating pairs of factual and unfactual texts. We then rely on human annotators to both validate the quality of our dataset and to create a gold standard test set for factuality evaluation systems.
arXiv Detail & Related papers (2024-11-29T12:21:15Z)
Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [53.6472920229013]
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks. LLMs are prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning. We introduce Q*, a framework for guiding LLMs decoding process with deliberative planning.
arXiv Detail & Related papers (2024-06-20T13:08:09Z)
Zero-shot LLM-guided Counterfactual Generation: A Case Study on NLP Model Evaluation [15.254775341371364]
We explore the possibility of leveraging large language models for zero-shot counterfactual generation. We propose a structured pipeline to facilitate this generation, and we hypothesize that the instruction-following and textual understanding capabilities of recent LLMs can be effectively leveraged.
arXiv Detail & Related papers (2024-05-08T03:57:45Z)
LLMs May Perform MCQA by Selecting the Least Incorrect Option [29.202758753639078]
Large Language Models (LLMs) have markedly enhanced performance across a variety of tasks.<n>The adoption of Multiple Choice Question Answering (MCQA) as a benchmark for assessing LLMs has gained considerable traction.<n>However, concerns regarding the robustness of this evaluative method persist.
arXiv Detail & Related papers (2024-02-02T12:07:00Z)
Enhancing Large Language Model Performance To Answer Questions and Extract Information More Accurately [2.1715455600756646]
Large Language Models (LLMs) generate responses to questions. Their effectiveness is often hindered by sub-optimal quality of answers and occasional failures to provide accurate responses to questions. To address these challenges, a fine-tuning process is employed, involving feedback and examples to refine models.
arXiv Detail & Related papers (2024-01-27T00:18:07Z)
FairSISA: Ensemble Post-Processing to Improve Fairness of Unlearning in LLMs [6.689848416609951]
We study the interplay between unlearning and fairness for large language models (LLMs) We focus on a popular unlearning framework known as SISA, which creates an ensemble of models trained on disjoint shards. We propose post-processing bias mitigation techniques for ensemble models produced by SISA.
arXiv Detail & Related papers (2023-12-12T16:44:47Z)
CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark. In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship. We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z)
Adapting Large Language Models for Content Moderation: Pitfalls in Data Engineering and Supervised Fine-tuning [79.53130089003986]
Large Language Models (LLMs) have become a feasible solution for handling tasks in various domains. In this paper, we introduce how to fine-tune a LLM model that can be privately deployed for content moderation.
arXiv Detail & Related papers (2023-10-05T09:09:44Z)
Simultaneous Machine Translation with Large Language Models [51.470478122113356]
We investigate the possibility of applying Large Language Models to SimulMT tasks. We conducted experiments using the textttLlama2-7b-chat model on nine different languages from the MUST-C dataset. The results show that LLM outperforms dedicated MT models in terms of BLEU and LAAL metrics.
arXiv Detail & Related papers (2023-09-13T04:06:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.