Related papers: Automatic Generation of Behavioral Test Cases For Natural Language Processing Using Clustering and Prompting

Automatic Generation of Behavioral Test Cases For Natural Language Processing Using Clustering and Prompting

URL: http://arxiv.org/abs/2408.00161v2
Date: Thu, 8 Aug 2024 16:31:05 GMT
Title: Automatic Generation of Behavioral Test Cases For Natural Language Processing Using Clustering and Prompting
Authors: Ying Li, Rahul Singh, Tarun Joshi, Agus Sudjianto,
Abstract summary: This paper introduces an automated approach to develop test cases by exploiting the power of large language models and statistical techniques. We analyze the behavioral test profiles across four different classification algorithms and discuss the limitations and strengths of those models.
Score: 6.938766764201549
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent work in behavioral testing for natural language processing (NLP) models, such as Checklist, is inspired by related paradigms in software engineering testing. They allow evaluation of general linguistic capabilities and domain understanding, hence can help evaluate conceptual soundness and identify model weaknesses. However, a major challenge is the creation of test cases. The current packages rely on semi-automated approach using manual development which requires domain expertise and can be time consuming. This paper introduces an automated approach to develop test cases by exploiting the power of large language models and statistical techniques. It clusters the text representations to carefully construct meaningful groups and then apply prompting techniques to automatically generate Minimal Functionality Tests (MFT). The well-known Amazon Reviews corpus is used to demonstrate our approach. We analyze the behavioral test profiles across four different classification algorithms and discuss the limitations and strengths of those models.

Related papers

Software Testing with Large Language Models: An Interview Study with Practitioners [2.198430261120653]
The use of large language models in software testing is growing fast as they support numerous tasks.<n>However, their adoption often relies on informal experimentation rather than structured guidance.<n>This study investigates how software testing professionals use LLMs in practice to propose a preliminary, practitioner-informed guideline.
arXiv Detail & Related papers (2025-10-20T05:06:56Z)
AutoTestForge: A Multidimensional Automated Testing Framework for Natural Language Processing Models [11.958545255487735]
We introduce AutoTestForge, an automated and multidimensional testing framework for NLP models. Within AutoTestForge, through the utilization of Large Language Models (LLMs) to automatically generate test templates and instantiate them, manual involvement is significantly reduced. The framework also extends the test suite across three dimensions, taxonomy, fairness, and robustness, offering a comprehensive evaluation of the capabilities of NLP models.
arXiv Detail & Related papers (2025-03-07T02:44:17Z)
Likelihood as a Performance Gauge for Retrieval-Augmented Generation [78.28197013467157]
We show that likelihoods serve as an effective gauge for language model performance. We propose two methods that use question likelihood as a gauge for selecting and constructing prompts that lead to better performance.
arXiv Detail & Related papers (2024-11-12T13:14:09Z)
Multi-language Unit Test Generation using LLMs [6.259245181881262]
We describe a generic pipeline that incorporates static analysis to guide LLMs in generating compilable and high-coverage test cases. We show how the pipeline can be applied to different programming languages, specifically Java and Python, and to complex software requiring environment mocking. Our results demonstrate that LLM-based test generation, when guided by static analysis, can be competitive with, and even outperform, state-of-the-art test-generation techniques in coverage achieved.
arXiv Detail & Related papers (2024-09-04T21:46:18Z)
Adapting Vision-Language Models to Open Classes via Test-Time Prompt Tuning [50.26965628047682]
Adapting pre-trained models to open classes is a challenging problem in machine learning. In this paper, we consider combining the advantages of both and come up with a test-time prompt tuning approach. Our proposed method outperforms all comparison methods on average considering both base and new classes.
arXiv Detail & Related papers (2024-08-29T12:34:01Z)
KAT: Dependency-aware Automated API Testing with Large Language Models [1.7264233311359707]
KAT (Katalon API Testing) is a novel AI-driven approach that autonomously generates test cases to validate APIs. Our evaluation of KAT using 12 real-world services shows that it can improve validation coverage, detect more undocumented status codes, and reduce false positives in these services.
arXiv Detail & Related papers (2024-07-14T14:48:18Z)
Automating REST API Postman Test Cases Using LLM [0.0]
This research paper is dedicated to the exploration and implementation of an automated approach to generate test cases using Large Language Models. The methodology integrates the use of Open AI to enhance the efficiency and effectiveness of test case generation. The model that is developed during the research is trained using manually collected postman test cases or instances for various Rest APIs.
arXiv Detail & Related papers (2024-04-16T15:53:41Z)
Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings. An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts) This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z)
Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges. Our model is trained on user queries and LLM-generated responses under massive real-world scenarios. Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z)
FIND: A Function Description Benchmark for Evaluating Interpretability Methods [86.80718559904854]
This paper introduces FIND (Function INterpretation and Description), a benchmark suite for evaluating automated interpretability methods. FIND contains functions that resemble components of trained neural networks, and accompanying descriptions of the kind we seek to generate. We evaluate methods that use pretrained language models to produce descriptions of function behavior in natural language and code.
arXiv Detail & Related papers (2023-09-07T17:47:26Z)
Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing [78.8500633981247]
This paper surveys and organizes research works in a new paradigm in natural language processing, which we dub "prompt-based learning" Unlike traditional supervised learning, which trains a model to take in an input x and predict an output y as P(y|x), prompt-based learning is based on language models that model the probability of text directly.
arXiv Detail & Related papers (2021-07-28T18:09:46Z)
Automatic Construction of Evaluation Suites for Natural Language Generation Datasets [17.13484629172643]
We develop a framework to generate controlled perturbations and identify subsets in text-to-scalar, text-to-text, or data-to-text settings. We propose an evaluation suite made of 80 challenge sets, demonstrate the kinds of analyses that it enables and shed light onto the limits of current generation models.
arXiv Detail & Related papers (2021-06-16T18:20:58Z)
Exploring Software Naturalness through Neural Language Models [56.1315223210742]
The Software Naturalness hypothesis argues that programming languages can be understood through the same techniques used in natural language processing. We explore this hypothesis through the use of a pre-trained transformer-based language model to perform code analysis tasks.
arXiv Detail & Related papers (2020-06-22T21:56:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.