DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT
Models
- URL: http://arxiv.org/abs/2306.11698v5
- Date: Mon, 26 Feb 2024 20:41:01 GMT
- Title: DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT
Models
- Authors: Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang,
Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, Sang T.
Truong, Simran Arora, Mantas Mazeika, Dan Hendrycks, Zinan Lin, Yu Cheng,
Sanmi Koyejo, Dawn Song, Bo Li
- Abstract summary: This work proposes a comprehensive trustworthiness evaluation for large language models with a focus on GPT-4 and GPT-3.5.
We find that GPT models can be easily misled to generate toxic and biased outputs and leak private information.
Our work illustrates a comprehensive trustworthiness evaluation of GPT models and sheds light on the trustworthiness gaps.
- Score: 92.6951708781736
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Generative Pre-trained Transformer (GPT) models have exhibited exciting
progress in their capabilities, capturing the interest of practitioners and the
public alike. Yet, while the literature on the trustworthiness of GPT models
remains limited, practitioners have proposed employing capable GPT models for
sensitive applications such as healthcare and finance -- where mistakes can be
costly. To this end, this work proposes a comprehensive trustworthiness
evaluation for large language models with a focus on GPT-4 and GPT-3.5,
considering diverse perspectives -- including toxicity, stereotype bias,
adversarial robustness, out-of-distribution robustness, robustness on
adversarial demonstrations, privacy, machine ethics, and fairness. Based on our
evaluations, we discover previously unpublished vulnerabilities to
trustworthiness threats. For instance, we find that GPT models can be easily
misled to generate toxic and biased outputs and leak private information in
both training data and conversation history. We also find that although GPT-4
is usually more trustworthy than GPT-3.5 on standard benchmarks, GPT-4 is more
vulnerable given jailbreaking system or user prompts, potentially because GPT-4
follows (misleading) instructions more precisely. Our work illustrates a
comprehensive trustworthiness evaluation of GPT models and sheds light on the
trustworthiness gaps. Our benchmark is publicly available at
https://decodingtrust.github.io/ ; our dataset can be previewed at
https://huggingface.co/datasets/AI-Secure/DecodingTrust ; a concise version of
this work is at https://openreview.net/pdf?id=kaHpo8OZw2 .
Related papers
- Granting GPT-4 License and Opportunity: Enhancing Accuracy and Confidence Estimation for Few-Shot Event Detection [6.718542027371254]
Large Language Models (LLMs) have shown enough promise in few-shot learning context to suggest use in the generation of "silver" data.
Confidence estimation is a documented weakness of models such as GPT-4.
The present effort explores methods for effective confidence estimation with GPT-4 with few-shot learning for event detection in the BETTER License as a vehicle.
arXiv Detail & Related papers (2024-08-01T21:08:07Z) - Detect Llama -- Finding Vulnerabilities in Smart Contracts using Large Language Models [27.675558033502565]
We fine-tune open-source models to outperform GPT-4 in smart contract vulnerability detection.
For binary classification (i.e., is this smart contract vulnerable?), our two best-performing models, GPT-3.5FT and Detect Llama - Foundation, achieve F1 scores of.
For the evaluation against individual vulnerability identification, our top two models, GPT-3.5FT and Detect Llama - Foundation, both significantly outperformed GPT-4 and GPT-4 Turbo.
arXiv Detail & Related papers (2024-07-12T03:33:13Z) - Unveiling the Safety of GPT-4o: An Empirical Study using Jailbreak Attacks [65.84623493488633]
This paper conducts a rigorous evaluation of GPT-4o against jailbreak attacks.
The newly introduced audio modality opens up new attack vectors for jailbreak attacks on GPT-4o.
Existing black-box multimodal jailbreak attack methods are largely ineffective against GPT-4o and GPT-4V.
arXiv Detail & Related papers (2024-06-10T14:18:56Z) - Reconstruct Your Previous Conversations! Comprehensively Investigating Privacy Leakage Risks in Conversations with GPT Models [20.92843974858305]
GPT models are increasingly being used for task optimization.
In this paper, we introduce a straightforward yet potent Conversation Reconstruction Attack.
We present two advanced attacks targeting improved reconstruction of past conversations.
arXiv Detail & Related papers (2024-02-05T13:18:42Z) - Llamas Know What GPTs Don't Show: Surrogate Models for Confidence
Estimation [70.27452774899189]
Large language models (LLMs) should signal low confidence on examples where they are incorrect, instead of misleading the user.
As of November 2023, state-of-the-art LLMs do not provide access to these probabilities.
Our best method composing linguistic confidences and surrogate model probabilities gives state-of-the-art confidence estimates on all 12 datasets.
arXiv Detail & Related papers (2023-11-15T11:27:44Z) - A negation detection assessment of GPTs: analysis with the xNot360
dataset [9.165119034384027]
Negation is a fundamental aspect of natural language, playing a critical role in communication and comprehension.
We focus on the identification of negation in natural language using a zero-shot prediction approach applied to our custom xNot360 dataset.
Our findings expose a considerable performance disparity among the GPT models, with GPT-4 surpassing its counterparts and GPT-3.5 displaying a marked performance reduction.
arXiv Detail & Related papers (2023-06-29T02:27:48Z) - GPT-4 Technical Report [116.90398195245983]
GPT-4 is a large-scale, multimodal model which can accept image and text inputs and produce text outputs.
It exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers.
arXiv Detail & Related papers (2023-03-15T17:15:04Z) - Consistency Analysis of ChatGPT [65.268245109828]
This paper investigates the trustworthiness of ChatGPT and GPT-4 regarding logically consistent behaviour.
Our findings suggest that while both models appear to show an enhanced language understanding and reasoning ability, they still frequently fall short of generating logically consistent predictions.
arXiv Detail & Related papers (2023-03-11T01:19:01Z) - Prompting GPT-3 To Be Reliable [117.23966502293796]
This work decomposes reliability into four facets: generalizability, fairness, calibration, and factuality.
We find that GPT-3 outperforms smaller-scale supervised models by large margins on all these facets.
arXiv Detail & Related papers (2022-10-17T14:52:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.