ChatGPT Hallucinates when Attributing Answers
- URL: http://arxiv.org/abs/2309.09401v1
- Date: Sun, 17 Sep 2023 23:49:12 GMT
- Title: ChatGPT Hallucinates when Attributing Answers
- Authors: Guido Zuccon, Bevan Koopman, Razia Shaik
- Abstract summary: We investigate how different prompts impact answers and evidence.
We find that ChatGPT provides correct or partially correct answers in about half of the cases.
But its suggested references only exist 14% of the times.
- Score: 27.63520311803786
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Can ChatGPT provide evidence to support its answers? Does the evidence it
suggests actually exist and does it really support its answer? We investigate
these questions using a collection of domain-specific knowledge-based
questions, specifically prompting ChatGPT to provide both an answer and
supporting evidence in the form of references to external sources. We also
investigate how different prompts impact answers and evidence. We find that
ChatGPT provides correct or partially correct answers in about half of the
cases (50.6% of the times), but its suggested references only exist 14% of the
times. We further provide insights on the generated references that reveal
common traits among the references that ChatGPT generates, and show how even if
a reference provided by the model does exist, this reference often does not
support the claims ChatGPT attributes to it. Our findings are important because
(1) they are the first systematic analysis of the references created by ChatGPT
in its answers; (2) they suggest that the model may leverage good quality
information in producing correct answers, but is unable to attribute real
evidence to support its answers. Prompts, raw result files and manual analysis
are made publicly available.
Related papers
- Evaluating ChatGPT as a Question Answering System: A Comprehensive
Analysis and Comparison with Existing Models [0.0]
This article scrutinizes ChatGPT as a Question Answering System (QAS)
The primary focus is on evaluating ChatGPT's proficiency in extracting responses from provided paragraphs.
The evaluation highlights hallucinations, where ChatGPT provides responses to questions without available answers in the provided context.
arXiv Detail & Related papers (2023-12-11T08:49:18Z) - Primacy Effect of ChatGPT [69.49920102917598]
We study the primacy effect of ChatGPT: the tendency of selecting the labels at earlier positions as the answer.
We hope that our experiments and analyses provide additional insights into building more reliable ChatGPT-based solutions.
arXiv Detail & Related papers (2023-10-20T00:37:28Z) - What has ChatGPT read? The origins of archaeological citations used by a
generative artificial intelligence application [0.0]
This paper tested what archaeological literature appears to have been included in ChatGPT's training phase.
While ChatGPT offered seemingly pertinent references, a large percentage proved to be fictitious.
It can be shown that all references provided by ChatGPT that were found to be genuine have also been cited on Wikipedia pages.
arXiv Detail & Related papers (2023-08-07T05:06:35Z) - CORE-GPT: Combining Open Access research and large language models for
credible, trustworthy question answering [0.6537685198688536]
We present CORE-GPT, a novel question-answering platform that combines GPT-based language models and more than 32 million full-text open access scientific articles from CORE.
We first demonstrate that GPT3.5 and GPT4 cannot be relied upon to provide references or citations for generated text.
We then introduce CORE-GPT which delivers evidence-based answers to questions, along with citations and links to the cited papers.
arXiv Detail & Related papers (2023-07-06T13:41:36Z) - ChatLog: Carefully Evaluating the Evolution of ChatGPT Across Time [54.18651663847874]
ChatGPT has achieved great success and can be considered to have acquired an infrastructural status.
Existing benchmarks encounter two challenges: (1) Disregard for periodical evaluation and (2) Lack of fine-grained features.
We construct ChatLog, an ever-updating dataset with large-scale records of diverse long-form ChatGPT responses for 21 NLP benchmarks from March, 2023 to now.
arXiv Detail & Related papers (2023-04-27T11:33:48Z) - Why Does ChatGPT Fall Short in Providing Truthful Answers? [31.656442655938445]
We investigate ChatGPT's failures in providing truthful answers to user questions.
We identify two critical abilities associated with factuality: knowledge memorization and knowledge recall.
Our findings suggest that augmenting the model with granular external knowledge and cues for knowledge recall can enhance the model's factuality in answering questions.
arXiv Detail & Related papers (2023-04-20T17:48:43Z) - To ChatGPT, or not to ChatGPT: That is the question! [78.407861566006]
This study provides a comprehensive and contemporary assessment of the most recent techniques in ChatGPT detection.
We have curated a benchmark dataset consisting of prompts from ChatGPT and humans, including diverse questions from medical, open Q&A, and finance domains.
Our evaluation results demonstrate that none of the existing methods can effectively detect ChatGPT-generated content.
arXiv Detail & Related papers (2023-04-04T03:04:28Z) - ChatGPT is a Knowledgeable but Inexperienced Solver: An Investigation of Commonsense Problem in Large Language Models [49.52083248451775]
Large language models (LLMs) have made significant progress in NLP.
We specifically focus on ChatGPT, a widely used and easily accessible LLM.
We conduct a series of experiments on 11 datasets to evaluate ChatGPT's commonsense abilities.
arXiv Detail & Related papers (2023-03-29T03:05:43Z) - Can ChatGPT Understand Too? A Comparative Study on ChatGPT and
Fine-tuned BERT [103.57103957631067]
ChatGPT has attracted great attention, as it can generate fluent and high-quality responses to human inquiries.
We evaluate ChatGPT's understanding ability by evaluating it on the most popular GLUE benchmark, and comparing it with 4 representative fine-tuned BERT-style models.
We find that: 1) ChatGPT falls short in handling paraphrase and similarity tasks; 2) ChatGPT outperforms all BERT models on inference tasks by a large margin; 3) ChatGPT achieves comparable performance compared with BERT on sentiment analysis and question answering tasks.
arXiv Detail & Related papers (2023-02-19T12:29:33Z) - REM-Net: Recursive Erasure Memory Network for Commonsense Evidence
Refinement [130.8875535449478]
REM-Net is equipped with a module to refine the evidence by erasing the low-quality evidence that does not explain the question answering.
Instead of retrieving evidence from existing knowledge bases, REM-Net leverages a pre-trained generative model to generate candidate evidence customized for the question.
The results demonstrate the performance of REM-Net and show that the refined evidence is explainable.
arXiv Detail & Related papers (2020-12-24T10:07:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.