Related papers: I'm Afraid I Can't Do That: Predicting Prompt Refusal in Black-Box Generative Language Models

I'm Afraid I Can't Do That: Predicting Prompt Refusal in Black-Box Generative Language Models

URL: http://arxiv.org/abs/2306.03423v2
Date: Wed, 14 Jun 2023 05:13:34 GMT
Title: I'm Afraid I Can't Do That: Predicting Prompt Refusal in Black-Box Generative Language Models
Authors: Max Reuter, William Schulze
Abstract summary: We characterize ChatGPT's refusal behavior using a black-box attack. We map several different kinds of responses to a binary of compliance or refusal. We train a prompt classifier to predict whether ChatGPT will refuse a question, without seeing ChatGPT's response.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Since the release of OpenAI's ChatGPT, generative language models have attracted extensive public attention. The increased usage has highlighted generative models' broad utility, but also revealed several forms of embedded bias. Some is induced by the pre-training corpus; but additional bias specific to generative models arises from the use of subjective fine-tuning to avoid generating harmful content. Fine-tuning bias may come from individual engineers and company policies, and affects which prompts the model chooses to refuse. In this experiment, we characterize ChatGPT's refusal behavior using a black-box attack. We first query ChatGPT with a variety of offensive and benign prompts (n=1,706), then manually label each response as compliance or refusal. Manual examination of responses reveals that refusal is not cleanly binary, and lies on a continuum; as such, we map several different kinds of responses to a binary of compliance or refusal. The small manually-labeled dataset is used to train a refusal classifier, which achieves an accuracy of 96%. Second, we use this refusal classifier to bootstrap a larger (n=10,000) dataset adapted from the Quora Insincere Questions dataset. With this machine-labeled data, we train a prompt classifier to predict whether ChatGPT will refuse a given question, without seeing ChatGPT's response. This prompt classifier achieves 76% accuracy on a test set of manually labeled questions (n=985). We examine our classifiers and the prompt n-grams that are most predictive of either compliance or refusal. Our datasets and code are available at https://github.com/maxwellreuter/chatgpt-refusals.

Related papers

Predicting the Performance of Black-box LLMs through Self-Queries [60.87193950962585]
Large language models (LLMs) are increasingly relied on in AI systems, predicting when they make mistakes is crucial. In this paper, we extract features of LLMs in a black-box manner by using follow-up prompts and taking the probabilities of different responses as representations. We demonstrate that training a linear model on these low-dimensional representations produces reliable predictors of model performance at the instance level.
arXiv Detail & Related papers (2025-01-02T22:26:54Z)
Forging the Forger: An Attempt to Improve Authorship Verification via Data Augmentation [52.72682366640554]
Authorship Verification (AV) is a text classification task concerned with inferring whether a candidate text has been written by one specific author or by someone else. It has been shown that many AV systems are vulnerable to adversarial attacks, where a malicious author actively tries to fool the classifier by either concealing their writing style, or by imitating the style of another author.
arXiv Detail & Related papers (2024-03-17T16:36:26Z)
GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection? [50.53312866647302]
HateCheck is a suite for testing fine-grained model functionalities on synthesized data. We propose GPT-HateCheck, a framework to generate more diverse and realistic functional tests from scratch. Crowd-sourced annotation demonstrates that the generated test cases are of high quality.
arXiv Detail & Related papers (2024-02-23T10:02:01Z)
"My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models [40.867655189493924]
Open-ended nature of language generation makes evaluation of large language models (LLMs) challenging. One common evaluation approach uses multiple-choice questions (MCQ) to limit the response space. We evaluate how aligned first-token evaluation is with the text output along several dimensions.
arXiv Detail & Related papers (2024-02-22T12:47:33Z)
Employing Label Models on ChatGPT Answers Improves Legal Text Entailment Performance [5.484345596034158]
ChatGPT is robust in many natural language processing tasks, including legal text entailment. We use label models to integrate the provisional answers by ChatGPT into consolidated labels. The experimental results demonstrate that this approach can attain an accuracy of 76.15%, marking a significant improvement of 8.26% over the prior state-of-the-art benchmark.
arXiv Detail & Related papers (2024-01-31T15:04:01Z)
Primacy Effect of ChatGPT [69.49920102917598]
We study the primacy effect of ChatGPT: the tendency of selecting the labels at earlier positions as the answer. We hope that our experiments and analyses provide additional insights into building more reliable ChatGPT-based solutions.
arXiv Detail & Related papers (2023-10-20T00:37:28Z)
CBBQ: A Chinese Bias Benchmark Dataset Curated with Human-AI Collaboration for Large Language Models [52.25049362267279]
We present a Chinese Bias Benchmark dataset that consists of over 100K questions jointly constructed by human experts and generative language models. The testing instances in the dataset are automatically derived from 3K+ high-quality templates manually authored with stringent quality control. Extensive experiments demonstrate the effectiveness of the dataset in detecting model bias, with all 10 publicly available Chinese large language models exhibiting strong bias in certain categories.
arXiv Detail & Related papers (2023-06-28T14:14:44Z)
Realistic Conversational Question Answering with Answer Selection based on Calibrated Confidence and Uncertainty Measurement [54.55643652781891]
Conversational Question Answering (ConvQA) models aim at answering a question with its relevant paragraph and previous question-answer pairs that occurred during conversation multiple times. We propose to filter out inaccurate answers in the conversation history based on their estimated confidences and uncertainties from the ConvQA model. We validate our models, Answer Selection-based realistic Conversation Question Answering, on two standard ConvQA datasets.
arXiv Detail & Related papers (2023-02-10T09:42:07Z)
Combing for Credentials: Active Pattern Extraction from Smart Reply [15.097010165958027]
We investigate potential information leakage vulnerabilities in a typical Smart Reply pipeline. We introduce a new type of active extraction attack that exploits canonical patterns in text containing sensitive data. We show experimentally that it is possible for an adversary to extract sensitive user information present in the training data, even in realistic settings.
arXiv Detail & Related papers (2022-07-14T05:03:56Z)
Few-shot Instruction Prompts for Pretrained Language Models to Detect Social Biases [55.45617404586874]
We propose a few-shot instruction-based method for prompting pre-trained language models (LMs) We show that large LMs can detect different types of fine-grained biases with similar and sometimes superior accuracy to fine-tuned models.
arXiv Detail & Related papers (2021-12-15T04:19:52Z)
Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP [10.936043362876651]
We propose a decoding algorithm that reduces the probability of a model producing problematic text. While our approach does by no means eliminate the issue of language models generating biased text, we believe it to be an important step in this direction.
arXiv Detail & Related papers (2021-02-28T11:07:37Z)
Hate Speech Detection and Racial Bias Mitigation in Social Media based on BERT model [1.9336815376402716]
We introduce a transfer learning approach for hate speech detection based on an existing pre-trained language model called BERT. We evaluate the proposed model on two publicly available datasets annotated for racism, sexism, hate or offensive content on Twitter.
arXiv Detail & Related papers (2020-08-14T16:47:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.