Related papers: DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

URL: http://arxiv.org/abs/2306.11698v5
Date: Mon, 26 Feb 2024 20:41:01 GMT
Title: DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models
Authors: Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, Sang T. Truong, Simran Arora, Mantas Mazeika, Dan Hendrycks, Zinan Lin, Yu Cheng, Sanmi Koyejo, Dawn Song, Bo Li
Abstract summary: This work proposes a comprehensive trustworthiness evaluation for large language models with a focus on GPT-4 and GPT-3.5. We find that GPT models can be easily misled to generate toxic and biased outputs and leak private information. Our work illustrates a comprehensive trustworthiness evaluation of GPT models and sheds light on the trustworthiness gaps.
Score: 92.6951708781736
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Generative Pre-trained Transformer (GPT) models have exhibited exciting progress in their capabilities, capturing the interest of practitioners and the public alike. Yet, while the literature on the trustworthiness of GPT models remains limited, practitioners have proposed employing capable GPT models for sensitive applications such as healthcare and finance -- where mistakes can be costly. To this end, this work proposes a comprehensive trustworthiness evaluation for large language models with a focus on GPT-4 and GPT-3.5, considering diverse perspectives -- including toxicity, stereotype bias, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness. Based on our evaluations, we discover previously unpublished vulnerabilities to trustworthiness threats. For instance, we find that GPT models can be easily misled to generate toxic and biased outputs and leak private information in both training data and conversation history. We also find that although GPT-4 is usually more trustworthy than GPT-3.5 on standard benchmarks, GPT-4 is more vulnerable given jailbreaking system or user prompts, potentially because GPT-4 follows (misleading) instructions more precisely. Our work illustrates a comprehensive trustworthiness evaluation of GPT models and sheds light on the trustworthiness gaps. Our benchmark is publicly available at https://decodingtrust.github.io/ ; our dataset can be previewed at https://huggingface.co/datasets/AI-Secure/DecodingTrust ; a concise version of this work is at https://openreview.net/pdf?id=kaHpo8OZw2 .

Related papers

Towards Safer Chatbots: A Framework for Policy Compliance Evaluation of Custom GPTs [7.687215328455751]
We present a framework for the automated evaluation of Custom GPTs against OpenAI's usage policies. We evaluate it through a large-scale study with 782 Custom GPTs across three categories: Romantic, Cybersecurity, and Academic GPTs. The results reveal that 58.7% of the analyzed models exhibit indications of non-compliance, exposing weaknesses in the GPT store's review and approval processes.
arXiv Detail & Related papers (2025-02-03T15:19:28Z)
Exploring ChatGPT for Face Presentation Attack Detection in Zero and Few-Shot in-Context Learning [6.537257913467247]
This study highlights the potential of ChatGPT (specifically GPT-4o) as a competitive alternative for Face Presentation Attack Detection (PAD) Our results show that GPT-4o demonstrates high consistency, particularly in few-shot in-context learning. Remarkably, the model exhibits emergent reasoning capabilities, correctly predicting the attack type (print or replay) with high accuracy in few-shot scenarios.
arXiv Detail & Related papers (2025-01-15T13:46:33Z)
Granting GPT-4 License and Opportunity: Enhancing Accuracy and Confidence Estimation for Few-Shot Event Detection [6.718542027371254]
Large Language Models (LLMs) have shown enough promise in few-shot learning context to suggest use in the generation of "silver" data. Confidence estimation is a documented weakness of models such as GPT-4. The present effort explores methods for effective confidence estimation with GPT-4 with few-shot learning for event detection in the BETTER License as a vehicle.
arXiv Detail & Related papers (2024-08-01T21:08:07Z)
Detect Llama -- Finding Vulnerabilities in Smart Contracts using Large Language Models [27.675558033502565]
We fine-tune open-source models to outperform GPT-4 in smart contract vulnerability detection. For binary classification (i.e., is this smart contract vulnerable?), our two best-performing models, GPT-3.5FT and Detect Llama - Foundation, achieve F1 scores of. For the evaluation against individual vulnerability identification, our top two models, GPT-3.5FT and Detect Llama - Foundation, both significantly outperformed GPT-4 and GPT-4 Turbo.
arXiv Detail & Related papers (2024-07-12T03:33:13Z)
Unveiling the Safety of GPT-4o: An Empirical Study using Jailbreak Attacks [65.84623493488633]
This paper conducts a rigorous evaluation of GPT-4o against jailbreak attacks. The newly introduced audio modality opens up new attack vectors for jailbreak attacks on GPT-4o. Existing black-box multimodal jailbreak attack methods are largely ineffective against GPT-4o and GPT-4V.
arXiv Detail & Related papers (2024-06-10T14:18:56Z)
Reconstruct Your Previous Conversations! Comprehensively Investigating Privacy Leakage Risks in Conversations with GPT Models [20.92843974858305]
GPT models are increasingly being used for task optimization. In this paper, we introduce a straightforward yet potent Conversation Reconstruction Attack. We present two advanced attacks targeting improved reconstruction of past conversations.
arXiv Detail & Related papers (2024-02-05T13:18:42Z)
Llamas Know What GPTs Don't Show: Surrogate Models for Confidence Estimation [70.27452774899189]
Large language models (LLMs) should signal low confidence on examples where they are incorrect, instead of misleading the user. As of November 2023, state-of-the-art LLMs do not provide access to these probabilities. Our best method composing linguistic confidences and surrogate model probabilities gives state-of-the-art confidence estimates on all 12 datasets.
arXiv Detail & Related papers (2023-11-15T11:27:44Z)
A negation detection assessment of GPTs: analysis with the xNot360 dataset [9.165119034384027]
Negation is a fundamental aspect of natural language, playing a critical role in communication and comprehension. We focus on the identification of negation in natural language using a zero-shot prediction approach applied to our custom xNot360 dataset. Our findings expose a considerable performance disparity among the GPT models, with GPT-4 surpassing its counterparts and GPT-3.5 displaying a marked performance reduction.
arXiv Detail & Related papers (2023-06-29T02:27:48Z)
GPT-4 Technical Report [116.90398195245983]
GPT-4 is a large-scale, multimodal model which can accept image and text inputs and produce text outputs. It exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers.
arXiv Detail & Related papers (2023-03-15T17:15:04Z)
Consistency Analysis of ChatGPT [65.268245109828]
This paper investigates the trustworthiness of ChatGPT and GPT-4 regarding logically consistent behaviour. Our findings suggest that while both models appear to show an enhanced language understanding and reasoning ability, they still frequently fall short of generating logically consistent predictions.
arXiv Detail & Related papers (2023-03-11T01:19:01Z)
Prompting GPT-3 To Be Reliable [117.23966502293796]
This work decomposes reliability into four facets: generalizability, fairness, calibration, and factuality. We find that GPT-3 outperforms smaller-scale supervised models by large margins on all these facets.
arXiv Detail & Related papers (2022-10-17T14:52:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.