DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT
Models
- URL: http://arxiv.org/abs/2306.11698v5
- Date: Mon, 26 Feb 2024 20:41:01 GMT
- Title: DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT
Models
- Authors: Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang,
Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, Sang T.
Truong, Simran Arora, Mantas Mazeika, Dan Hendrycks, Zinan Lin, Yu Cheng,
Sanmi Koyejo, Dawn Song, Bo Li
- Abstract summary: This work proposes a comprehensive trustworthiness evaluation for large language models with a focus on GPT-4 and GPT-3.5.
We find that GPT models can be easily misled to generate toxic and biased outputs and leak private information.
Our work illustrates a comprehensive trustworthiness evaluation of GPT models and sheds light on the trustworthiness gaps.
- Score: 92.6951708781736
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Generative Pre-trained Transformer (GPT) models have exhibited exciting
progress in their capabilities, capturing the interest of practitioners and the
public alike. Yet, while the literature on the trustworthiness of GPT models
remains limited, practitioners have proposed employing capable GPT models for
sensitive applications such as healthcare and finance -- where mistakes can be
costly. To this end, this work proposes a comprehensive trustworthiness
evaluation for large language models with a focus on GPT-4 and GPT-3.5,
considering diverse perspectives -- including toxicity, stereotype bias,
adversarial robustness, out-of-distribution robustness, robustness on
adversarial demonstrations, privacy, machine ethics, and fairness. Based on our
evaluations, we discover previously unpublished vulnerabilities to
trustworthiness threats. For instance, we find that GPT models can be easily
misled to generate toxic and biased outputs and leak private information in
both training data and conversation history. We also find that although GPT-4
is usually more trustworthy than GPT-3.5 on standard benchmarks, GPT-4 is more
vulnerable given jailbreaking system or user prompts, potentially because GPT-4
follows (misleading) instructions more precisely. Our work illustrates a
comprehensive trustworthiness evaluation of GPT models and sheds light on the
trustworthiness gaps. Our benchmark is publicly available at
https://decodingtrust.github.io/ ; our dataset can be previewed at
https://huggingface.co/datasets/AI-Secure/DecodingTrust ; a concise version of
this work is at https://openreview.net/pdf?id=kaHpo8OZw2 .
Related papers
- Towards Safer Chatbots: A Framework for Policy Compliance Evaluation of Custom GPTs [7.687215328455751]
We present a framework for the automated evaluation of Custom GPTs against OpenAI's usage policies.
We evaluate it through a large-scale study with 782 Custom GPTs across three categories: Romantic, Cybersecurity, and Academic GPTs.
The results reveal that 58.7% of the analyzed models exhibit indications of non-compliance, exposing weaknesses in the GPT store's review and approval processes.
arXiv Detail & Related papers (2025-02-03T15:19:28Z) - Exploring ChatGPT for Face Presentation Attack Detection in Zero and Few-Shot in-Context Learning [6.537257913467247]
This study highlights the potential of ChatGPT (specifically GPT-4o) as a competitive alternative for Face Presentation Attack Detection (PAD)
Our results show that GPT-4o demonstrates high consistency, particularly in few-shot in-context learning.
Remarkably, the model exhibits emergent reasoning capabilities, correctly predicting the attack type (print or replay) with high accuracy in few-shot scenarios.
arXiv Detail & Related papers (2025-01-15T13:46:33Z) - Detect Llama -- Finding Vulnerabilities in Smart Contracts using Large Language Models [27.675558033502565]
We fine-tune open-source models to outperform GPT-4 in smart contract vulnerability detection.
For binary classification (i.e., is this smart contract vulnerable?), our two best-performing models, GPT-3.5FT and Detect Llama - Foundation, achieve F1 scores of.
For the evaluation against individual vulnerability identification, our top two models, GPT-3.5FT and Detect Llama - Foundation, both significantly outperformed GPT-4 and GPT-4 Turbo.
arXiv Detail & Related papers (2024-07-12T03:33:13Z) - Unveiling the Safety of GPT-4o: An Empirical Study using Jailbreak Attacks [65.84623493488633]
This paper conducts a rigorous evaluation of GPT-4o against jailbreak attacks.
The newly introduced audio modality opens up new attack vectors for jailbreak attacks on GPT-4o.
Existing black-box multimodal jailbreak attack methods are largely ineffective against GPT-4o and GPT-4V.
arXiv Detail & Related papers (2024-06-10T14:18:56Z) - Reconstruct Your Previous Conversations! Comprehensively Investigating Privacy Leakage Risks in Conversations with GPT Models [20.92843974858305]
GPT models are increasingly being used for task optimization.
In this paper, we introduce a straightforward yet potent Conversation Reconstruction Attack.
We present two advanced attacks targeting improved reconstruction of past conversations.
arXiv Detail & Related papers (2024-02-05T13:18:42Z) - Llamas Know What GPTs Don't Show: Surrogate Models for Confidence
Estimation [70.27452774899189]
Large language models (LLMs) should signal low confidence on examples where they are incorrect, instead of misleading the user.
As of November 2023, state-of-the-art LLMs do not provide access to these probabilities.
Our best method composing linguistic confidences and surrogate model probabilities gives state-of-the-art confidence estimates on all 12 datasets.
arXiv Detail & Related papers (2023-11-15T11:27:44Z) - GPT-4 Technical Report [116.90398195245983]
GPT-4 is a large-scale, multimodal model which can accept image and text inputs and produce text outputs.
It exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers.
arXiv Detail & Related papers (2023-03-15T17:15:04Z) - Consistency Analysis of ChatGPT [65.268245109828]
This paper investigates the trustworthiness of ChatGPT and GPT-4 regarding logically consistent behaviour.
Our findings suggest that while both models appear to show an enhanced language understanding and reasoning ability, they still frequently fall short of generating logically consistent predictions.
arXiv Detail & Related papers (2023-03-11T01:19:01Z) - Prompting GPT-3 To Be Reliable [117.23966502293796]
This work decomposes reliability into four facets: generalizability, fairness, calibration, and factuality.
We find that GPT-3 outperforms smaller-scale supervised models by large margins on all these facets.
arXiv Detail & Related papers (2022-10-17T14:52:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.