Checking HateCheck: a cross-functional analysis of behaviour-aware
learning for hate speech detection
- URL: http://arxiv.org/abs/2204.04042v1
- Date: Fri, 8 Apr 2022 13:03:01 GMT
- Title: Checking HateCheck: a cross-functional analysis of behaviour-aware
learning for hate speech detection
- Authors: Pedro Henrique Luz de Araujo and Benjamin Roth
- Abstract summary: We investigate fine-tuning schemes using HateCheck, a suite of functional tests for hate speech detection systems.
We train and evaluate models on different configurations of HateCheck by holding out categories of test cases.
The fine-tuning procedure led to improvements in the classification accuracy of held-out functionalities and identity groups.
However, performance on held-out functionality classes and i.i.d. hate speech detection data decreased, which indicates that generalisation occurs mostly across functionalities from the same class.
- Score: 4.0810783261728565
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Behavioural testing -- verifying system capabilities by validating
human-designed input-output pairs -- is an alternative evaluation method of
natural language processing systems proposed to address the shortcomings of the
standard approach: computing metrics on held-out data. While behavioural tests
capture human prior knowledge and insights, there has been little exploration
on how to leverage them for model training and development. With this in mind,
we explore behaviour-aware learning by examining several fine-tuning schemes
using HateCheck, a suite of functional tests for hate speech detection systems.
To address potential pitfalls of training on data originally intended for
evaluation, we train and evaluate models on different configurations of
HateCheck by holding out categories of test cases, which enables us to estimate
performance on potentially overlooked system properties. The fine-tuning
procedure led to improvements in the classification accuracy of held-out
functionalities and identity groups, suggesting that models can potentially
generalise to overlooked functionalities. However, performance on held-out
functionality classes and i.i.d. hate speech detection data decreased, which
indicates that generalisation occurs mostly across functionalities from the
same class and that the procedure led to overfitting to the HateCheck data
distribution.
Related papers
- An Auditing Test To Detect Behavioral Shift in Language Models [28.52295230939529]
We present a method for continual Behavioral Shift Auditing (BSA) in language models.
BSA detects behavioral shifts solely through model generations.
We find that the test is able to detect meaningful changes in behavior distributions using just hundreds of examples.
arXiv Detail & Related papers (2024-10-25T09:09:31Z) - Explanatory Model Monitoring to Understand the Effects of Feature Shifts on Performance [61.06245197347139]
We propose a novel approach to explain the behavior of a black-box model under feature shifts.
We refer to our method that combines concepts from Optimal Transport and Shapley Values as Explanatory Performance Estimation.
arXiv Detail & Related papers (2024-08-24T18:28:19Z) - Bisimulation Learning [55.859538562698496]
We compute finite bisimulations of state transition systems with large, possibly infinite state space.
Our technique yields faster verification results than alternative state-of-the-art tools in practice.
arXiv Detail & Related papers (2024-05-24T17:11:27Z) - GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection? [50.53312866647302]
HateCheck is a suite for testing fine-grained model functionalities on synthesized data.
We propose GPT-HateCheck, a framework to generate more diverse and realistic functional tests from scratch.
Crowd-sourced annotation demonstrates that the generated test cases are of high quality.
arXiv Detail & Related papers (2024-02-23T10:02:01Z) - Probabilistic Safety Regions Via Finite Families of Scalable Classifiers [2.431537995108158]
Supervised classification recognizes patterns in the data to separate classes of behaviours.
Canonical solutions contain misclassification errors that are intrinsic to the numerical approximating nature of machine learning.
We introduce the concept of probabilistic safety region to describe a subset of the input space in which the number of misclassified instances is probabilistically controlled.
arXiv Detail & Related papers (2023-09-08T22:40:19Z) - Cross-functional Analysis of Generalisation in Behavioural Learning [4.0810783261728565]
We introduce BeLUGA, an analysis method for evaluating behavioural learning considering generalisation across dimensions of different levels.
An aggregate score measures generalisation to unseen functionalities (or overfitting)
arXiv Detail & Related papers (2023-05-22T11:54:19Z) - Zero-shot Model Diagnosis [80.36063332820568]
A common approach to evaluate deep learning models is to build a labeled test set with attributes of interest and assess how well it performs.
This paper argues the case that Zero-shot Model Diagnosis (ZOOM) is possible without the need for a test set nor labeling.
arXiv Detail & Related papers (2023-03-27T17:59:33Z) - Deep Learning for Hate Speech Detection: A Comparative Study [54.42226495344908]
We present here a large-scale empirical comparison of deep and shallow hate-speech detection methods.
Our goal is to illuminate progress in the area, and identify strengths and weaknesses in the current state-of-the-art.
In doing so we aim to provide guidance as to the use of hate-speech detection in practice, quantify the state-of-the-art, and identify future research directions.
arXiv Detail & Related papers (2022-02-19T03:48:20Z) - Visualizing Classifier Adjacency Relations: A Case Study in Speaker
Verification and Voice Anti-Spoofing [72.4445825335561]
We propose a simple method to derive 2D representation from detection scores produced by an arbitrary set of binary classifiers.
Based upon rank correlations, our method facilitates a visual comparison of classifiers with arbitrary scores.
While the approach is fully versatile and can be applied to any detection task, we demonstrate the method using scores produced by automatic speaker verification and voice anti-spoofing systems.
arXiv Detail & Related papers (2021-06-11T13:03:33Z) - HateCheck: Functional Tests for Hate Speech Detection Models [3.4938484663205776]
We introduce HateCheck, a first suite of functional tests for hate speech detection models.
We specify 29 model functionalities, the selection of which we motivate by reviewing previous research.
We test near-state-of-the-art transformer detection models as well as a popular commercial model, revealing critical model weaknesses.
arXiv Detail & Related papers (2020-12-31T13:44:56Z) - Understanding Failures of Deep Networks via Robust Feature Extraction [44.204907883776045]
We introduce and study a method aimed at characterizing and explaining failures by identifying visual attributes whose presence or absence results in poor performance.
We leverage the representation of a separate robust model to extract interpretable features and then harness these features to identify failure modes.
arXiv Detail & Related papers (2020-12-03T08:33:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.