Related papers: Behaviour Driven Development Scenario Generation with Large Language Models

Behaviour Driven Development Scenario Generation with Large Language Models

URL: http://arxiv.org/abs/2603.04729v1
Date: Thu, 05 Mar 2026 02:05:48 GMT
Title: Behaviour Driven Development Scenario Generation with Large Language Models
Authors: Amila Rathnayake, Mojtaba Shahin, Golnoush Abaei,
Abstract summary: This paper presents an evaluation of three LLMs, GPT-4, Claude 3, and Gemini, for automated Behaviour-Driven Development scenarios generation.<n>We constructed a dataset of 500 user stories, requirement descriptions, and their corresponding BDD scenarios, drawn from four proprietary software products.
Score: 3.255679497255447
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper presents an evaluation of three LLMs, GPT-4, Claude 3, and Gemini, for automated Behaviour-Driven Development (BDD) scenarios generation. To support this evaluation, we constructed a dataset of 500 user stories, requirement descriptions, and their corresponding BDD scenarios, drawn from four proprietary software products. We assessed the quality of BDD scenarios generated by LLMs using a multidimensional evaluation framework encompassing text and semantic similarity metrics, LLM-based evaluation, and human expert assessment. Our findings reveal that although GPT-4 achieves higher scores in text and semantic similarity metrics, Claude 3 produces scenarios rated highest by both human experts and LLM-based evaluators. LLM-based evaluators, particularly DeepSeek, show a stronger correlation with human judgment than with text similarity and semantic similarity metrics. The effectiveness of prompting techniques is model-specific: GPT-4 performs best with zero-shot, Claude 3 benefits from chain-of-thought reasoning, and Gemini achieves optimal results with few-shot examples. Input quality determines the effectiveness of BDD scenario generation: detailed requirement descriptions alone yield high-quality scenarios, whereas user stories alone yield low-quality scenarios. Our experiments indicate that setting temperature to 0 and top_p to 1.0 produced the highest-quality BDD scenarios across all models.

Related papers

No-Human in the Loop: Agentic Evaluation at Scale for Recommendation [11.764010898952677]
evaluating large language models (LLMs) as judges is increasingly critical for building scalable and trustworthy evaluation pipelines.<n>We present ScalingEval, a large-scale benchmarking study that systematically compares 36 LLMs, including GPT, Gemini, Claude, and Llama.<n>Our multi-agent framework aggregates pattern audits and issue codes into ground-truth labels via scalable majority voting.
arXiv Detail & Related papers (2025-11-04T22:49:39Z)
Multi-Domain ABSA Conversation Dataset Generation via LLMs for Real-World Evaluation and Model Comparison [0.0]
This paper presents an approach for generating synthetic ABSA data using Large Language Models (LLMs)<n>We detail the generation process aimed at producing data with consistent topic and sentiment distributions across multiple domains using GPT-4o.<n>Our results demonstrate the effectiveness of the synthetic data, revealing distinct performance trade-offs among the models.
arXiv Detail & Related papers (2025-05-30T15:24:17Z)
Reliable Decision Support with LLMs: A Framework for Evaluating Consistency in Binary Text Classification Applications [0.7124971549479361]
This study introduces a framework for evaluating consistency in large language model (LLM) binary text classification.<n>We determine sample size requirements, develop metrics for invalid responses, and evaluate intra- and inter-rater reliability.
arXiv Detail & Related papers (2025-05-20T21:12:58Z)
Self-Rationalization in the Wild: A Large Scale Out-of-Distribution Evaluation on NLI-related tasks [59.47851630504264]
Free-text explanations are expressive and easy to understand, but many datasets lack annotated explanation data.<n>We fine-tune T5-Large and OLMo-7B models and assess the impact of fine-tuning data quality, the number of fine-tuning samples, and few-shot selection methods.<n>The models are evaluated on 19 diverse OOD datasets across three tasks: natural language inference (NLI), fact-checking, and hallucination detection in abstractive summarization.
arXiv Detail & Related papers (2025-02-07T10:01:32Z)
OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain [62.89809156574998]
We introduce an omnidirectional and automatic RAG benchmark, OmniEval, in the financial domain.<n>Our benchmark is characterized by its multi-dimensional evaluation framework.<n>Our experiments demonstrate the comprehensiveness of OmniEval, which includes extensive test datasets.
arXiv Detail & Related papers (2024-12-17T15:38:42Z)
Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets. The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method. The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z)
BatchEval: Towards Human-like Text Evaluation [12.187982795098623]
BatchEval is a paradigm that conducts batch-wise evaluation iteratively to alleviate the above problems. We show that BatchEval outperforms state-of-the-art methods by 10.5% on Pearson correlations with only 64% API cost on average.
arXiv Detail & Related papers (2023-12-31T09:34:51Z)
CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting. CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z)
ExtractGPT: Exploring the Potential of Large Language Models for Product Attribute Value Extraction [51.87391234815163]
E-commerce platforms require structured product data in the form of attribute-value pairs.<n>BERT-based extraction methods require large amounts of task-specific training data.<n>This paper explores using large language models (LLMs) as a more training-data efficient and robust alternative.
arXiv Detail & Related papers (2023-10-19T07:39:00Z)
UMSE: Unified Multi-scenario Summarization Evaluation [52.60867881867428]
Summarization quality evaluation is a non-trivial task in text summarization. We propose Unified Multi-scenario Summarization Evaluation Model (UMSE) Our UMSE is the first unified summarization evaluation framework engaged with the ability to be used in three evaluation scenarios.
arXiv Detail & Related papers (2023-05-26T12:54:44Z)
Large Language Models are Diverse Role-Players for Summarization Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal. Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions. We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.