In ChatGPT We Trust? Measuring and Characterizing the Reliability of
ChatGPT
- URL: http://arxiv.org/abs/2304.08979v2
- Date: Thu, 5 Oct 2023 13:27:12 GMT
- Title: In ChatGPT We Trust? Measuring and Characterizing the Reliability of
ChatGPT
- Authors: Xinyue Shen and Zeyuan Chen and Michael Backes and Yang Zhang
- Abstract summary: ChatGPT has attracted more than 100 million users within a short period of time.
We perform the first large-scale measurement of ChatGPT's reliability in the generic QA scenario.
We find that ChatGPT's reliability varies across different domains, especially underperforming in law and science questions.
- Score: 44.51625917839939
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The way users acquire information is undergoing a paradigm shift with the
advent of ChatGPT. Unlike conventional search engines, ChatGPT retrieves
knowledge from the model itself and generates answers for users. ChatGPT's
impressive question-answering (QA) capability has attracted more than 100
million users within a short period of time but has also raised concerns
regarding its reliability. In this paper, we perform the first large-scale
measurement of ChatGPT's reliability in the generic QA scenario with a
carefully curated set of 5,695 questions across ten datasets and eight domains.
We find that ChatGPT's reliability varies across different domains, especially
underperforming in law and science questions. We also demonstrate that system
roles, originally designed by OpenAI to allow users to steer ChatGPT's
behavior, can impact ChatGPT's reliability in an imperceptible way. We further
show that ChatGPT is vulnerable to adversarial examples, and even a single
character change can negatively affect its reliability in certain cases. We
believe that our study provides valuable insights into ChatGPT's reliability
and underscores the need for strengthening the reliability and security of
large language models (LLMs).
Related papers
- Exploring ChatGPT's Capabilities on Vulnerability Management [56.4403395100589]
We explore ChatGPT's capabilities on 6 tasks involving the complete vulnerability management process with a large-scale dataset containing 70,346 samples.
One notable example is ChatGPT's proficiency in tasks like generating titles for software bug reports.
Our findings reveal the difficulties encountered by ChatGPT and shed light on promising future directions.
arXiv Detail & Related papers (2023-11-11T11:01:13Z) - When ChatGPT Meets Smart Contract Vulnerability Detection: How Far Are We? [34.61179425241671]
We present an empirical study to investigate the performance of ChatGPT in identifying smart contract vulnerabilities.
ChatGPT achieves a high recall rate, but its precision in pinpointing smart contract vulnerabilities is limited.
Our research provides insights into the strengths and weaknesses of employing large language models, specifically ChatGPT, for the detection of smart contract vulnerabilities.
arXiv Detail & Related papers (2023-09-11T15:02:44Z) - ChatGPT is a Remarkable Tool -- For Experts [9.46644539427004]
We explore the potential of ChatGPT to enhance productivity, streamline problem-solving processes, and improve writing style.
We highlight the potential risks associated with excessive reliance on ChatGPT in these fields.
We outline areas and objectives where ChatGPT proves beneficial, applications where it should be used judiciously, and scenarios where its reliability may be limited.
arXiv Detail & Related papers (2023-06-02T06:28:21Z) - ChatLog: Carefully Evaluating the Evolution of ChatGPT Across Time [54.18651663847874]
ChatGPT has achieved great success and can be considered to have acquired an infrastructural status.
Existing benchmarks encounter two challenges: (1) Disregard for periodical evaluation and (2) Lack of fine-grained features.
We construct ChatLog, an ever-updating dataset with large-scale records of diverse long-form ChatGPT responses for 21 NLP benchmarks from March, 2023 to now.
arXiv Detail & Related papers (2023-04-27T11:33:48Z) - Evaluating ChatGPT's Information Extraction Capabilities: An Assessment
of Performance, Explainability, Calibration, and Faithfulness [18.945934162722466]
We focus on assessing the overall ability of ChatGPT using 7 fine-grained information extraction (IE) tasks.
ChatGPT's performance in Standard-IE setting is poor, but it surprisingly exhibits excellent performance in the OpenIE setting.
ChatGPT provides high-quality and trustworthy explanations for its decisions.
arXiv Detail & Related papers (2023-04-23T12:33:18Z) - Consistency Analysis of ChatGPT [65.268245109828]
This paper investigates the trustworthiness of ChatGPT and GPT-4 regarding logically consistent behaviour.
Our findings suggest that while both models appear to show an enhanced language understanding and reasoning ability, they still frequently fall short of generating logically consistent predictions.
arXiv Detail & Related papers (2023-03-11T01:19:01Z) - Can ChatGPT Understand Too? A Comparative Study on ChatGPT and
Fine-tuned BERT [103.57103957631067]
ChatGPT has attracted great attention, as it can generate fluent and high-quality responses to human inquiries.
We evaluate ChatGPT's understanding ability by evaluating it on the most popular GLUE benchmark, and comparing it with 4 representative fine-tuned BERT-style models.
We find that: 1) ChatGPT falls short in handling paraphrase and similarity tasks; 2) ChatGPT outperforms all BERT models on inference tasks by a large margin; 3) ChatGPT achieves comparable performance compared with BERT on sentiment analysis and question answering tasks.
arXiv Detail & Related papers (2023-02-19T12:29:33Z) - Is ChatGPT a General-Purpose Natural Language Processing Task Solver? [113.22611481694825]
Large language models (LLMs) have demonstrated the ability to perform a variety of natural language processing (NLP) tasks zero-shot.
Recently, the debut of ChatGPT has drawn a great deal of attention from the natural language processing (NLP) community.
It is not yet known whether ChatGPT can serve as a generalist model that can perform many NLP tasks zero-shot.
arXiv Detail & Related papers (2023-02-08T09:44:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.