Related papers: Beyond Accuracy: Behavioral Testing of NLP models with CheckList

Beyond Accuracy: Behavioral Testing of NLP models with CheckList

URL: http://arxiv.org/abs/2005.04118v1
Date: Fri, 8 May 2020 15:48:31 GMT
Title: Beyond Accuracy: Behavioral Testing of NLP models with CheckList
Authors: Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, Sameer Singh
Abstract summary: CheckList is a task-agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation. In a user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.
Score: 66.42971817954806
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckList, a task-agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. We illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.

Related papers

AutoTestForge: A Multidimensional Automated Testing Framework for Natural Language Processing Models [11.958545255487735]
We introduce AutoTestForge, an automated and multidimensional testing framework for NLP models. Within AutoTestForge, through the utilization of Large Language Models (LLMs) to automatically generate test templates and instantiate them, manual involvement is significantly reduced. The framework also extends the test suite across three dimensions, taxonomy, fairness, and robustness, offering a comprehensive evaluation of the capabilities of NLP models.
arXiv Detail & Related papers (2025-03-07T02:44:17Z)
Context-Aware Testing: A New Paradigm for Model Testing with Large Language Models [49.06068319380296]
We introduce context-aware testing (CAT) which uses context as an inductive bias to guide the search for meaningful model failures. We instantiate the first CAT system, SMART Testing, which employs large language models to hypothesize relevant and likely failures.
arXiv Detail & Related papers (2024-10-31T15:06:16Z)
SYNTHEVAL: Hybrid Behavioral Testing of NLP Models with Synthetic CheckLists [59.08999823652293]
We propose SYNTHEVAL to generate a wide range of test types for a comprehensive evaluation of NLP models. In the last stage, human experts investigate the challenging examples, manually design templates, and identify the types of failures the taskspecific models consistently exhibit. We apply SYNTHEVAL to two classification tasks, sentiment analysis and toxic language detection, and show that our framework is effective in identifying weaknesses of strong models on these tasks.
arXiv Detail & Related papers (2024-08-30T17:41:30Z)
Leveraging Large Language Models for Enhancing the Understandability of Generated Unit Tests [4.574205608859157]
We introduce UTGen, which combines search-based software testing and large language models to enhance the understandability of automatically generated test cases. We observe that participants working on assignments with UTGen test cases fix up to 33% more bugs and use up to 20% less time when compared to baseline test cases.
arXiv Detail & Related papers (2024-08-21T15:35:34Z)
GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection? [50.53312866647302]
HateCheck is a suite for testing fine-grained model functionalities on synthesized data. We propose GPT-HateCheck, a framework to generate more diverse and realistic functional tests from scratch. Crowd-sourced annotation demonstrates that the generated test cases are of high quality.
arXiv Detail & Related papers (2024-02-23T10:02:01Z)
A General Model for Aggregating Annotations Across Simple, Complex, and Multi-Object Annotation Tasks [51.14185612418977]
A strategy to improve label quality is to ask multiple annotators to label the same item and aggregate their labels. While a variety of bespoke models have been proposed for specific tasks, our work is the first to introduce aggregation methods that generalize across many diverse complex tasks. This article extends our prior work with investigation of three new research questions.
arXiv Detail & Related papers (2023-12-20T21:28:35Z)
Deep anytime-valid hypothesis testing [29.273915933729057]
We propose a general framework for constructing powerful, sequential hypothesis tests for nonparametric testing problems. We develop a principled approach of leveraging the representation capability of machine learning models within the testing-by-betting framework. Empirical results on synthetic and real-world datasets demonstrate that tests instantiated using our general framework are competitive against specialized baselines.
arXiv Detail & Related papers (2023-10-30T09:46:19Z)
Automatic Generation of Test Cases based on Bug Reports: a Feasibility Study with Large Language Models [4.318319522015101]
Existing approaches produce test cases that either can be qualified as simple (e.g. unit tests) or that require precise specifications. Most testing procedures still rely on test cases written by humans to form test suites. We investigate the feasibility of performing this generation by leveraging large language models (LLMs) and using bug reports as inputs.
arXiv Detail & Related papers (2023-10-10T05:30:12Z)
Zero-shot Model Diagnosis [80.36063332820568]
A common approach to evaluate deep learning models is to build a labeled test set with attributes of interest and assess how well it performs. This paper argues the case that Zero-shot Model Diagnosis (ZOOM) is possible without the need for a test set nor labeling.
arXiv Detail & Related papers (2023-03-27T17:59:33Z)
White-box Testing of NLP models with Mask Neuron Coverage [30.508750085817717]
We propose a set of white-box testing methods that are customized for transformer-based NLP models. MNCOVER measures how thoroughly the attention layers in models are exercised during testing. We show how MNCOVER can be used to guide CheckList input generation, evaluate alternative NLP testing methods, and drive data augmentation to improve accuracy.
arXiv Detail & Related papers (2022-05-10T17:07:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.