Test Oracle Automation in the era of LLMs
- URL: http://arxiv.org/abs/2405.12766v1
- Date: Tue, 21 May 2024 13:19:10 GMT
- Title: Test Oracle Automation in the era of LLMs
- Authors: Facundo Molina, Alessandra Gorla,
- Abstract summary: Large Language Models (LLMs) have demonstrated remarkable proficiency in tackling diverse software testing tasks.
This paper aims to enable discussions on the potential of using LLMs for test oracle automation, along with the challenges that may emerge during the generation of various types of oracles.
- Score: 52.69509240442899
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The effectiveness of a test suite in detecting faults highly depends on the correctness and completeness of its test oracles. Large Language Models (LLMs) have already demonstrated remarkable proficiency in tackling diverse software testing tasks, such as automated test generation and program repair. This paper aims to enable discussions on the potential of using LLMs for test oracle automation, along with the challenges that may emerge during the generation of various types of oracles. Additionally, our aim is to initiate discussions on the primary threats that SE researchers must consider when employing LLMs for oracle automation, encompassing concerns regarding oracle deficiencies and data leakages.
Related papers
- ASTRAL: Automated Safety Testing of Large Language Models [6.1050306667733185]
Large Language Models (LLMs) have recently gained attention due to their ability to understand and generate sophisticated human-like content.
We present ASTRAL, a tool that automates the generation and execution of test cases (i.e., prompts) for testing the safety of LLMs.
arXiv Detail & Related papers (2025-01-28T18:25:11Z) - Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation [55.21013307734612]
AoPS-Instruct is a dataset of more than 600,000 high-quality QA pairs.
LiveAoPSBench is an evolving evaluation set with timestamps, derived from the latest forum data.
Our work presents a scalable approach to creating and maintaining large-scale, high-quality datasets for advanced math reasoning.
arXiv Detail & Related papers (2025-01-24T06:39:38Z) - The Potential of LLMs in Automating Software Testing: From Generation to Reporting [0.0]
Manual testing, while effective, can be time consuming and costly, leading to an increased demand for automated methods.
Recent advancements in Large Language Models (LLMs) have significantly influenced software engineering.
This paper explores an agent-oriented approach to automated software testing, using LLMs to reduce human intervention and enhance testing efficiency.
arXiv Detail & Related papers (2024-12-31T02:06:46Z) - SpecTool: A Benchmark for Characterizing Errors in Tool-Use LLMs [77.79172008184415]
SpecTool is a new benchmark to identify error patterns in LLM output on tool-use tasks.
We show that even the most prominent LLMs exhibit these error patterns in their outputs.
Researchers can use the analysis and insights from SPECTOOL to guide their error mitigation strategies.
arXiv Detail & Related papers (2024-11-20T18:56:22Z) - PentestAgent: Incorporating LLM Agents to Automated Penetration Testing [6.815381197173165]
Manual penetration testing is time-consuming and expensive.
Recent advancements in large language models (LLMs) offer new opportunities for enhancing penetration testing.
We propose PentestAgent, a novel LLM-based automated penetration testing framework.
arXiv Detail & Related papers (2024-11-07T21:10:39Z) - AutoPT: How Far Are We from the End2End Automated Web Penetration Testing? [54.65079443902714]
We introduce AutoPT, an automated penetration testing agent based on the principle of PSM driven by LLMs.
Our results show that AutoPT outperforms the baseline framework ReAct on the GPT-4o mini model.
arXiv Detail & Related papers (2024-11-02T13:24:30Z) - Learning to Ask: When LLM Agents Meet Unclear Instruction [55.65312637965779]
Large language models (LLMs) can leverage external tools for addressing a range of tasks unattainable through language skills alone.
We evaluate the performance of LLMs tool-use under imperfect instructions, analyze the error patterns, and build a challenging tool-use benchmark called Noisy ToolBench.
We propose a novel framework, Ask-when-Needed (AwN), which prompts LLMs to ask questions to users whenever they encounter obstacles due to unclear instructions.
arXiv Detail & Related papers (2024-08-31T23:06:12Z) - TOGLL: Correct and Strong Test Oracle Generation with LLMs [0.8057006406834466]
Test oracles play a crucial role in software testing, enabling effective bug detection.
Despite initial promise, neural-based methods for automated test oracle generation often result in a large number of false positives.
We present the first comprehensive study to investigate the capabilities of LLMs in generating correct, diverse, and strong test oracles.
arXiv Detail & Related papers (2024-05-06T18:37:35Z) - LangBiTe: A Platform for Testing Bias in Large Language Models [1.9744907811058787]
Large Language Models (LLMs) are trained on a vast amount of data scrapped from forums, websites, social media and other internet sources.
LangBiTe enables development teams to tailor their test scenarios, and automatically generate and execute the test cases according to a set of user-defined ethical requirements.
LangBite provides users with the bias evaluation of LLMs, and end-to-end traceability between the initial ethical requirements and the insights obtained.
arXiv Detail & Related papers (2024-04-29T10:02:45Z) - Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks.
However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs.
We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.