Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and
Toxicity
- URL: http://arxiv.org/abs/2301.12867v4
- Date: Mon, 29 May 2023 17:46:54 GMT
- Title: Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and
Toxicity
- Authors: Terry Yue Zhuo, Yujin Huang, Chunyang Chen and Zhenchang Xing
- Abstract summary: Large language models (LLMs) may exhibit social prejudice and toxicity, posing ethical and societal dangers of consequences resulting from irresponsibility.
We empirically benchmark ChatGPT on multiple sample datasets.
We find that a significant number of ethical risks cannot be addressed by existing benchmarks.
- Score: 19.94836502156002
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent breakthroughs in natural language processing (NLP) have permitted the
synthesis and comprehension of coherent text in an open-ended way, therefore
translating the theoretical algorithms into practical applications. The large
language models (LLMs) have significantly impacted businesses such as report
summarization software and copywriters. Observations indicate, however, that
LLMs may exhibit social prejudice and toxicity, posing ethical and societal
dangers of consequences resulting from irresponsibility. Large-scale benchmarks
for accountable LLMs should consequently be developed. Although several
empirical investigations reveal the existence of a few ethical difficulties in
advanced LLMs, there is little systematic examination and user study of the
risks and harmful behaviors of current LLM usage. To further educate future
efforts on constructing ethical LLMs responsibly, we perform a qualitative
research method called ``red teaming'' on OpenAI's ChatGPT\footnote{In this
paper, ChatGPT refers to the version released on Dec 15th.} to better
understand the practical features of ethical dangers in recent LLMs. We analyze
ChatGPT comprehensively from four perspectives: 1) \textit{Bias} 2)
\textit{Reliability} 3) \textit{Robustness} 4) \textit{Toxicity}. In accordance
with our stated viewpoints, we empirically benchmark ChatGPT on multiple sample
datasets. We find that a significant number of ethical risks cannot be
addressed by existing benchmarks, and hence illustrate them via additional case
studies. In addition, we examine the implications of our findings on AI ethics
and harmal behaviors of ChatGPT, as well as future problems and practical
design considerations for responsible LLMs. We believe that our findings may
give light on future efforts to determine and mitigate the ethical hazards
posed by machines in LLM applications.
Related papers
- Potential and Perils of Large Language Models as Judges of Unstructured Textual Data [0.631976908971572]
This research investigates the effectiveness of LLM-as-judge models to evaluate the thematic alignment of summaries generated by other LLMs.
Our findings reveal that while LLM-as-judge offer a scalable solution comparable to human raters, humans may still excel at detecting subtle, context-specific nuances.
arXiv Detail & Related papers (2025-01-14T14:49:14Z) - Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLM [53.79753074854936]
Large language models (LLMs) are increasingly vulnerable to emerging jailbreak attacks.
This vulnerability poses significant risks to the real-world applications.
We propose a novel defensive paradigm called GuidelineLLM.
arXiv Detail & Related papers (2024-12-10T12:42:33Z) - Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-Oasis [78.07225438556203]
We introduce LLM-Oasis, the largest resource for training end-to-end factuality evaluators.
It is constructed by extracting claims from Wikipedia, falsifying a subset of these claims, and generating pairs of factual and unfactual texts.
We then rely on human annotators to both validate the quality of our dataset and to create a gold standard test set for factuality evaluation systems.
arXiv Detail & Related papers (2024-11-29T12:21:15Z) - CLAMBER: A Benchmark of Identifying and Clarifying Ambiguous Information Needs in Large Language Models [60.59638232596912]
We introduce CLAMBER, a benchmark for evaluating large language models (LLMs)
Building upon the taxonomy, we construct 12K high-quality data to assess the strengths, weaknesses, and potential risks of various off-the-shelf LLMs.
Our findings indicate the limited practical utility of current LLMs in identifying and clarifying ambiguous user queries.
arXiv Detail & Related papers (2024-05-20T14:34:01Z) - Eagle: Ethical Dataset Given from Real Interactions [74.7319697510621]
We create datasets extracted from real interactions between ChatGPT and users that exhibit social biases, toxicity, and immoral problems.
Our experiments show that Eagle captures complementary aspects, not covered by existing datasets proposed for evaluation and mitigation of such ethical challenges.
arXiv Detail & Related papers (2024-02-22T03:46:02Z) - Breaking the Silence: the Threats of Using LLMs in Software Engineering [12.368546216271382]
Large Language Models (LLMs) have gained considerable traction within the Software Engineering (SE) community.
This paper initiates an open discussion on potential threats to the validity of LLM-based research.
arXiv Detail & Related papers (2023-12-13T11:02:19Z) - FFT: Towards Harmlessness Evaluation and Analysis for LLMs with Factuality, Fairness, Toxicity [20.510512358961517]
The widespread of generative artificial intelligence has heightened concerns about the potential harms posed by AI-generated texts.
Previous researchers have invested much effort in assessing the harmlessness of generative language models.
arXiv Detail & Related papers (2023-11-30T14:18:47Z) - Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools.
Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions.
Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z) - Sentiment Analysis in the Era of Large Language Models: A Reality Check [69.97942065617664]
This paper investigates the capabilities of large language models (LLMs) in performing various sentiment analysis tasks.
We evaluate performance across 13 tasks on 26 datasets and compare the results against small language models (SLMs) trained on domain-specific datasets.
arXiv Detail & Related papers (2023-05-24T10:45:25Z) - Assessing Hidden Risks of LLMs: An Empirical Study on Robustness,
Consistency, and Credibility [37.682136465784254]
We conduct over a million queries to the mainstream large language models (LLMs) including ChatGPT, LLaMA, and OPT.
We find that ChatGPT is still capable to yield the correct answer even when the input is polluted at an extreme level.
We propose a novel index associated with a dataset that roughly decides the feasibility of using such data for LLM-involved evaluation.
arXiv Detail & Related papers (2023-05-15T15:44:51Z) - Causal Reasoning and Large Language Models: Opening a New Frontier for Causality [29.433401785920065]
Large language models (LLMs) can generate causal arguments with high probability.
LLMs may be used by human domain experts to save effort in setting up a causal analysis.
arXiv Detail & Related papers (2023-04-28T19:00:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.