Test-driven Software Experimentation with LASSO: an LLM Benchmarking Example
- URL: http://arxiv.org/abs/2410.08911v1
- Date: Fri, 11 Oct 2024 15:32:48 GMT
- Title: Test-driven Software Experimentation with LASSO: an LLM Benchmarking Example
- Authors: Marcus Kessel,
- Abstract summary: Test-Driven Software Experiments ( TDSEs) are experiments that involve the execution of software subjects and the observation and analysis of their "de facto" run-time behavior.
We present a general-purpose analysis platform called LASSO that provides a minimal set of domain-specific languages and data structures to conduct TDSEs.
- Score: 1.4685355149711299
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Empirical software engineering faces a critical gap: the lack of standardized tools for rapid development and execution of Test-Driven Software Experiments (TDSEs) - that is, experiments that involve the execution of software subjects and the observation and analysis of their "de facto" run-time behavior. In this paper we present a general-purpose analysis platform called LASSO that provides a minimal set of domain-specific languages and data structures to conduct TDSEs. By empowering users with an executable scripting language to design and execute TDSEs, LASSO enables efficient evaluation of run-time semantics and execution characteristics in addition to statically determined properties. We present an example TDSE that demonstrates the practical benefits of LASSO's scripting capabilities for assessing the reliability of LLMs for code generation by means of a self-contained, reusable and extensible study script. The LASSO platform is freely available at: https://softwareobservatorium.github.io/, and a demo video is available on YouTube: https://youtu.be/tzY9oNTWXzw
Related papers
- LELANTE: LEveraging LLM for Automated ANdroid TEsting [6.112769800569302]
Existing testing approaches require developers to manually write scripts using tools such as Appium and Espresso to execute the corresponding test case.
We introduce LELANTE, a novel framework that utilizes large language models (LLMs) to automate test case execution without requiring pre-written scripts.
In experiments across 390 test cases spanning 10 popular Android applications, LELANTE achieved a 73% test execution success rate.
arXiv Detail & Related papers (2025-04-29T16:13:49Z) - Skill Discovery for Software Scripting Automation via Offline Simulations with LLMs [63.10710876536337]
We propose an offline simulation framework to curate a software-specific skillset, a collection of verified scripts.
Our framework comprises two components: (1) task creation, using top-down functionality and bottom-up API synergy exploration to generate helpful tasks.
Experiments with Adobe Illustrator demonstrate that our framework significantly improves automation success rates, reduces response time, and saves runtime token costs.
arXiv Detail & Related papers (2025-04-29T04:03:37Z) - Test Wars: A Comparative Study of SBST, Symbolic Execution, and LLM-Based Approaches to Unit Test Generation [11.037212298533069]
Large Language Models (LLMs) have opened up new opportunities to generate tests automatically.
This paper studies automatic test generation approaches based on three tools: EvoSuite for SBST, Kex for symbolic execution, and TestSpark for LLM-based test generation.
Our results show that while LLM-based test generation is promising, it falls behind traditional methods in terms of coverage.
arXiv Detail & Related papers (2025-01-17T13:48:32Z) - Learning to Ask: When LLMs Meet Unclear Instruction [49.256630152684764]
Large language models (LLMs) can leverage external tools for addressing a range of tasks unattainable through language skills alone.
We evaluate the performance of LLMs tool-use under imperfect instructions, analyze the error patterns, and build a challenging tool-use benchmark called Noisy ToolBench.
We propose a novel framework, Ask-when-Needed (AwN), which prompts LLMs to ask questions to users whenever they encounter obstacles due to unclear instructions.
arXiv Detail & Related papers (2024-08-31T23:06:12Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - CIBench: Evaluating Your LLMs with a Code Interpreter Plugin [68.95137938214862]
We propose an interactive evaluation framework, named CIBench, to comprehensively assess LLMs' ability to utilize code interpreters for data science tasks.
The evaluation dataset is constructed using an LLM-human cooperative approach and simulates an authentic workflow by leveraging consecutive and interactive IPython sessions.
We conduct extensive experiments to analyze the ability of 24 LLMs on CIBench and provide valuable insights for future LLMs in code interpreter utilization.
arXiv Detail & Related papers (2024-07-15T07:43:55Z) - On the Evaluation of Large Language Models in Unit Test Generation [16.447000441006814]
Unit testing is an essential activity in software development for verifying the correctness of software components.
The emergence of Large Language Models (LLMs) offers a new direction for automating unit test generation.
arXiv Detail & Related papers (2024-06-26T08:57:03Z) - TESTEVAL: Benchmarking Large Language Models for Test Case Generation [15.343859279282848]
We propose TESTEVAL, a novel benchmark for test case generation with large language models (LLMs)
We collect 210 Python programs from an online programming platform, LeetCode, and design three different tasks: overall coverage, targeted line/branch coverage, and targeted path coverage.
We find that generating test cases to cover specific program lines/branches/paths is still challenging for current LLMs.
arXiv Detail & Related papers (2024-06-06T22:07:50Z) - Aligning Language Models with Demonstrated Feedback [58.834937450242975]
Demonstration ITerated Task Optimization (DITTO) directly aligns language model outputs to a user's demonstrated behaviors.
We evaluate DITTO's ability to learn fine-grained style and task alignment across domains such as news articles, emails, and blog posts.
arXiv Detail & Related papers (2024-06-02T23:13:56Z) - PPTC-R benchmark: Towards Evaluating the Robustness of Large Language
Models for PowerPoint Task Completion [96.47420221442397]
We construct adversarial user instructions by attacking user instructions at sentence, semantic, and multi-language levels.
We test 3 closed-source and 4 open-source LLMs using a benchmark that incorporates robustness settings.
We find that GPT-4 exhibits the highest performance and strong robustness in our benchmark.
arXiv Detail & Related papers (2024-03-06T15:33:32Z) - User Centric Evaluation of Code Generation Tools [2.8115477071897788]
This paper proposes a user centric method for evaluating the usability of large language models (LLMs)
It includes metadata in the test cases of a benchmark to describe their usages, conducts testing in a multi-attempt process that mimics the uses of LLMs, measures LLM generated solutions on a set of quality attributes that reflect usability, and evaluates the performance based on user experiences in the uses of LLMs as a tool.
arXiv Detail & Related papers (2024-02-05T15:56:19Z) - Are We Testing or Being Tested? Exploring the Practical Applications of
Large Language Models in Software Testing [0.0]
A Large Language Model (LLM) represents a cutting-edge artificial intelligence model that generates coherent content.
LLM can play a pivotal role in software development, including software testing.
This study explores the practical application of LLMs in software testing within an industrial setting.
arXiv Detail & Related papers (2023-12-08T06:30:37Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - LLM for Test Script Generation and Migration: Challenges, Capabilities,
and Opportunities [8.504639288314063]
Test script generation is a vital component of software testing, enabling efficient and reliable automation of repetitive test tasks.
Existing generation approaches often encounter limitations, such as difficulties in accurately capturing and reproducing test scripts across diverse devices, platforms, and applications.
This paper investigates the application of large language models (LLM) in the domain of mobile application test script generation.
arXiv Detail & Related papers (2023-09-24T07:58:57Z) - MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language
Feedback [78.60644407028022]
We introduce MINT, a benchmark that evaluates large language models' ability to solve tasks with multi-turn interactions.
LLMs generally benefit from tools and language feedback, with performance gains of 1-8% for each turn of tool use.
LLMs evaluated, supervised instruction-finetuning (SIFT) and reinforcement learning from human feedback (RLHF) generally hurt multi-turn capabilities.
arXiv Detail & Related papers (2023-09-19T15:25:42Z) - Instruction Tuning for Large Language Models: A Survey [52.86322823501338]
We make a systematic review of the literature, including the general methodology of supervised fine-tuning (SFT)
We also review the potential pitfalls of SFT along with criticism against it, along with efforts pointing out current deficiencies of existing strategies.
arXiv Detail & Related papers (2023-08-21T15:35:16Z) - How Effective are Large Language Models in Generating Software Specifications? [14.170320751508502]
Large Language Models (LLMs) have been successfully applied to numerous Software Engineering (SE) tasks.
We conduct the first empirical study to evaluate the capabilities of LLMs for generating software specifications from software comments or documentation.
arXiv Detail & Related papers (2023-06-06T00:28:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.