Assessing UML Models by ChatGPT: Implications for Education
- URL: http://arxiv.org/abs/2412.17200v1
- Date: Mon, 23 Dec 2024 00:28:33 GMT
- Title: Assessing UML Models by ChatGPT: Implications for Education
- Authors: Chong Wang, Beian Wang, Peng Liang, Jie Liang,
- Abstract summary: In software engineering (SE) research and practice, is well known as an essential modeling methodology.
Recent advancements in generative AI techniques, such as ChatGPT, have paved new ways to automate many SE tasks.
This paper aims to investigate the feasibility and effectiveness of ChatGPT in assessing the quality of models.
- Score: 9.11195766839205
- License:
- Abstract: In software engineering (SE) research and practice, UML is well known as an essential modeling methodology for requirements analysis and software modeling in both academia and industry. In particular, fundamental knowledge of UML modeling and practice in creating high-quality UML models are included in SE-relevant courses in the undergraduate programs of many universities. This leads to a time-consuming and labor-intensive task for educators to review and grade a large number of UML models created by the students. Recent advancements in generative AI techniques, such as ChatGPT, have paved new ways to automate many SE tasks. However, current research or tools seldom explore the capabilities of ChatGPT in evaluating the quality of UML models. This paper aims to investigate the feasibility and effectiveness of ChatGPT in assessing the quality of UML use case diagrams, class diagrams, and sequence diagrams. First, 11 evaluation criteria with grading details were proposed for these UML models. Next, a series of experiments were designed and conducted on 40 students' UML modeling reports to explore the performance of ChatGPT in evaluating and grading these UML diagrams. The research findings reveal that ChatGPT performed well in this assessing task because the scores that ChatGPT gives to the UML models are similar to the ones by human experts, and there are three evaluation discrepancies between ChatGPT and human experts, but varying in different evaluation criteria used in different types of UML models.
Related papers
- MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs [97.94579295913606]
Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia.
In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models.
This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods.
arXiv Detail & Related papers (2024-11-22T18:59:54Z) - OLMES: A Standard for Language Model Evaluations [64.85905119836818]
OLMES is a documented, practical, open standard for reproducible language model evaluations.
It supports meaningful comparisons between smaller base models that require the unnatural "cloze" formulation of multiple-choice questions.
OLMES includes well-considered, documented recommendations guided by results from existing literature as well as new experiments resolving open questions.
arXiv Detail & Related papers (2024-06-12T17:37:09Z) - How LLMs Aid in UML Modeling: An Exploratory Study with Novice Analysts [9.517655899237413]
GPT-3, Large Language Models (LLMs) have caught the eyes of researchers, practitioners, and educators in the field of software engineering.
This paper explores how LLMs can assist novice analysts in creating three types of typical models: use case models, class diagrams, and sequence diagrams.
arXiv Detail & Related papers (2024-04-27T00:38:20Z) - Assessing GPT-4-Vision's Capabilities in UML-Based Code Generation [0.5789654849162464]
GPT-4-Vision is a state-of-the-art deep learning model.
It can transform Unified Modeling Language (UML) class diagrams into fully operating Java class files.
arXiv Detail & Related papers (2024-04-22T17:21:24Z) - Model Generation with LLMs: From Requirements to UML Sequence Diagrams [9.114284818139069]
This paper investigates the capability of ChatGPT to generate a specific type of model, i.e., sequence diagrams, from NL requirements.
We examine the sequence diagrams generated by ChatGPT for 28 requirements documents of various types and from different domains.
Our results indicate that, although the models generally conform to the standard and exhibit a reasonable level of understandability, their completeness and correctness with respect to the specified requirements often present challenges.
arXiv Detail & Related papers (2024-04-09T15:07:25Z) - BLESS: Benchmarking Large Language Models on Sentence Simplification [55.461555829492866]
We present BLESS, a performance benchmark of the most recent state-of-the-art large language models (LLMs) on the task of text simplification (TS)
We assess a total of 44 models, differing in size, architecture, pre-training methods, and accessibility, on three test sets from different domains (Wikipedia, news, and medical) under a few-shot setting.
Our evaluation indicates that the best LLMs, despite not being trained on TS, perform comparably with state-of-the-art TS baselines.
arXiv Detail & Related papers (2023-10-24T12:18:17Z) - Large Language Models for In-Context Student Modeling: Synthesizing Student's Behavior in Visual Programming [29.65988680948297]
We explore the application of large language models (LLMs) for in-context student modeling in open-ended learning domains.
We introduce a novel framework, LLM for Student Synthesis (LLM-SS), that leverages LLMs for a student's behavior.
We instantiate several methods based on LLM-SS framework and evaluate them using an existing benchmark, StudentSyn, for student attempt synthesis in a visual programming domain.
arXiv Detail & Related papers (2023-10-15T12:56:13Z) - Evaluating the Performance of Large Language Models on GAOKAO Benchmark [53.663757126289795]
This paper introduces GAOKAO-Bench, an intuitive benchmark that employs questions from the Chinese GAOKAO examination as test samples.
With human evaluation, we obtain the converted total score of LLMs, including GPT-4, ChatGPT and ERNIE-Bot.
We also use LLMs to grade the subjective questions, and find that model scores achieve a moderate level of consistency with human scores.
arXiv Detail & Related papers (2023-05-21T14:39:28Z) - Large Language Models in the Workplace: A Case Study on Prompt
Engineering for Job Type Classification [58.720142291102135]
This case study investigates the task of job classification in a real-world setting.
The goal is to determine whether an English-language job posting is appropriate for a graduate or entry-level position.
arXiv Detail & Related papers (2023-03-13T14:09:53Z) - OPT-IML: Scaling Language Model Instruction Meta Learning through the
Lens of Generalization [101.37439352091612]
We describe the effect of instruction-tuning decisions on downstream task performance when scaling both model and benchmark sizes.
We present insights about instruction-tuning decisions as applied to OPT-30B and further exploit these insights to train OPT-IML 30B and 175B, which are instruction-tuned versions of OPT.
arXiv Detail & Related papers (2022-12-22T19:56:09Z) - Consistency of UML class, object and statechart diagrams using ontology
reasoners [0.0]
We propose an automatic approach to analyze consistency and satisfiability of Unified Modeling Language models containing multiple class, object and statechart diagrams.
We describe how to translate models in OWL 2 and we present a tool chain implementing this translation that can be used with any standard compliant modeling tool.
arXiv Detail & Related papers (2022-05-23T10:29:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.