How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation,
and Detection
- URL: http://arxiv.org/abs/2301.07597v1
- Date: Wed, 18 Jan 2023 15:23:25 GMT
- Title: How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation,
and Detection
- Authors: Biyang Guo, Xin Zhang, Ziyuan Wang, Minqi Jiang, Jinran Nie, Yuxuan
Ding, Jianwei Yue, Yupeng Wu
- Abstract summary: ChatGPT is able to respond effectively to a wide range of human questions.
People are starting to worry about the potential negative impacts that large language models (LLMs) like ChatGPT could have on society.
In this work, we collected tens of thousands of comparison responses from both human experts and ChatGPT.
- Score: 8.107721810172112
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The introduction of ChatGPT has garnered widespread attention in both
academic and industrial communities. ChatGPT is able to respond effectively to
a wide range of human questions, providing fluent and comprehensive answers
that significantly surpass previous public chatbots in terms of security and
usefulness. On one hand, people are curious about how ChatGPT is able to
achieve such strength and how far it is from human experts. On the other hand,
people are starting to worry about the potential negative impacts that large
language models (LLMs) like ChatGPT could have on society, such as fake news,
plagiarism, and social security issues. In this work, we collected tens of
thousands of comparison responses from both human experts and ChatGPT, with
questions ranging from open-domain, financial, medical, legal, and
psychological areas. We call the collected dataset the Human ChatGPT Comparison
Corpus (HC3). Based on the HC3 dataset, we study the characteristics of
ChatGPT's responses, the differences and gaps from human experts, and future
directions for LLMs. We conducted comprehensive human evaluations and
linguistic analyses of ChatGPT-generated content compared with that of humans,
where many interesting results are revealed. After that, we conduct extensive
experiments on how to effectively detect whether a certain text is generated by
ChatGPT or humans. We build three different detection systems, explore several
key factors that influence their effectiveness, and evaluate them in different
scenarios. The dataset, code, and models are all publicly available at
https://github.com/Hello-SimpleAI/chatgpt-comparison-detection.
Related papers
- ChatGPT and Human Synergy in Black-Box Testing: A Comparative Analysis [0.0]
ChatGPT can generate test cases that generally match or slightly surpass those created by human participants.
When ChatGPT cooperates with humans, it can cover considerably more test viewpoints than each can achieve alone.
arXiv Detail & Related papers (2024-01-25T03:42:17Z) - Exploring ChatGPT's Capabilities on Vulnerability Management [56.4403395100589]
We explore ChatGPT's capabilities on 6 tasks involving the complete vulnerability management process with a large-scale dataset containing 70,346 samples.
One notable example is ChatGPT's proficiency in tasks like generating titles for software bug reports.
Our findings reveal the difficulties encountered by ChatGPT and shed light on promising future directions.
arXiv Detail & Related papers (2023-11-11T11:01:13Z) - Comprehensive Assessment of Toxicity in ChatGPT [49.71090497696024]
We evaluate the toxicity in ChatGPT by utilizing instruction-tuning datasets.
prompts in creative writing tasks can be 2x more likely to elicit toxic responses.
Certain deliberately toxic prompts, designed in earlier studies, no longer yield harmful responses.
arXiv Detail & Related papers (2023-11-03T14:37:53Z) - Primacy Effect of ChatGPT [69.49920102917598]
We study the primacy effect of ChatGPT: the tendency of selecting the labels at earlier positions as the answer.
We hope that our experiments and analyses provide additional insights into building more reliable ChatGPT-based solutions.
arXiv Detail & Related papers (2023-10-20T00:37:28Z) - Can ChatGPT Reproduce Human-Generated Labels? A Study of Social
Computing Tasks [9.740764281808588]
ChatGPT has the potential to reproduce human-generated label annotations in social computing tasks.
We relabel five datasets covering stance detection (2x), sentiment analysis, hate speech, and bot detection.
Our results highlight that ChatGPT does have the potential to handle these data annotation tasks, although a number of challenges remain.
arXiv Detail & Related papers (2023-04-20T08:08:12Z) - ChatGPT-Crawler: Find out if ChatGPT really knows what it's talking
about [15.19126287569545]
This research examines the responses generated by ChatGPT from different Conversational QA corpora.
The study employed BERT similarity scores to compare these responses with correct answers and obtain Natural Language Inference(NLI) labels.
The study identified instances where ChatGPT provided incorrect answers to questions, providing insights into areas where the model may be prone to error.
arXiv Detail & Related papers (2023-04-06T18:42:47Z) - To ChatGPT, or not to ChatGPT: That is the question! [78.407861566006]
This study provides a comprehensive and contemporary assessment of the most recent techniques in ChatGPT detection.
We have curated a benchmark dataset consisting of prompts from ChatGPT and humans, including diverse questions from medical, open Q&A, and finance domains.
Our evaluation results demonstrate that none of the existing methods can effectively detect ChatGPT-generated content.
arXiv Detail & Related papers (2023-04-04T03:04:28Z) - Let's have a chat! A Conversation with ChatGPT: Technology,
Applications, and Limitations [0.0]
Chat Generative Pre-trained Transformer, better known as ChatGPT, can generate human-like sentences and write coherent essays.
Potential applications of ChatGPT in various domains, including healthcare, education, and research, are highlighted.
Despite promising results, there are several privacy and ethical concerns surrounding ChatGPT.
arXiv Detail & Related papers (2023-02-27T14:26:29Z) - Can ChatGPT Understand Too? A Comparative Study on ChatGPT and
Fine-tuned BERT [103.57103957631067]
ChatGPT has attracted great attention, as it can generate fluent and high-quality responses to human inquiries.
We evaluate ChatGPT's understanding ability by evaluating it on the most popular GLUE benchmark, and comparing it with 4 representative fine-tuned BERT-style models.
We find that: 1) ChatGPT falls short in handling paraphrase and similarity tasks; 2) ChatGPT outperforms all BERT models on inference tasks by a large margin; 3) ChatGPT achieves comparable performance compared with BERT on sentiment analysis and question answering tasks.
arXiv Detail & Related papers (2023-02-19T12:29:33Z) - A Categorical Archive of ChatGPT Failures [47.64219291655723]
ChatGPT, developed by OpenAI, has been trained using massive amounts of data and simulates human conversation.
It has garnered significant attention due to its ability to effectively answer a broad range of human inquiries.
However, a comprehensive analysis of ChatGPT's failures is lacking, which is the focus of this study.
arXiv Detail & Related papers (2023-02-06T04:21:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.