Evaluation of ChatGPT-Generated Medical Responses: A Systematic Review
and Meta-Analysis
- URL: http://arxiv.org/abs/2310.08410v1
- Date: Thu, 12 Oct 2023 15:26:26 GMT
- Title: Evaluation of ChatGPT-Generated Medical Responses: A Systematic Review
and Meta-Analysis
- Authors: Qiuhong Wei, Zhengxiong Yao, Ying Cui, Bo Wei, Zhezhen Jin, and Ximing
Xu
- Abstract summary: Large language models such as ChatGPT are increasingly explored in medical domains.
This study aims to summarize the available evidence on evaluating ChatGPT's performance in medicine.
- Score: 7.587141771901865
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models such as ChatGPT are increasingly explored in medical
domains. However, the absence of standard guidelines for performance evaluation
has led to methodological inconsistencies. This study aims to summarize the
available evidence on evaluating ChatGPT's performance in medicine and provide
direction for future research. We searched ten medical literature databases on
June 15, 2023, using the keyword "ChatGPT". A total of 3520 articles were
identified, of which 60 were reviewed and summarized in this paper and 17 were
included in the meta-analysis. The analysis showed that ChatGPT displayed an
overall integrated accuracy of 56% (95% CI: 51%-60%, I2 = 87%) in addressing
medical queries. However, the studies varied in question resource,
question-asking process, and evaluation metrics. Moreover, many studies failed
to report methodological details, including the version of ChatGPT and whether
each question was used independently or repeatedly. Our findings revealed that
although ChatGPT demonstrated considerable potential for application in
healthcare, the heterogeneity of the studies and insufficient reporting may
affect the reliability of these results. Further well-designed studies with
comprehensive and transparent reporting are needed to evaluate ChatGPT's
performance in medicine.
Related papers
- Evaluating the quality of published medical research with ChatGPT [4.786998989166]
evaluating the quality of published research is time-consuming but important for departmental evaluations, appointments, and promotions.
Previous research has shown that ChatGPT can score articles for research quality, with the results correlating positively with an indicator of quality in all fields except Clinical Medicine.
This article investigates this anomaly with the largest dataset yet and a more detailed analysis.
arXiv Detail & Related papers (2024-11-04T10:24:36Z) - Enhancing Medical Support in the Arabic Language Through Personalized ChatGPT Assistance [1.174020933567308]
ChatGPT provides real-time, personalized medical diagnosis at no cost.
The study involved compiling a dataset of disease information and generating multiple messages for each disease.
ChatGPT's performance was assessed by measuring the similarity between its responses and the actual diseases.
arXiv Detail & Related papers (2024-03-21T21:28:07Z) - AI Insights: A Case Study on Utilizing ChatGPT Intelligence for Research
Paper Analysis [0.0]
The study selected the textitApplication of Artificial Intelligence in Breast Cancer Treatment as the research topic.
Research papers related to this topic were collected from three major publication databases Google Scholar, Pubmed, and Scopus.
ChatGPT models were used to identify the category, scope, and relevant information from the research papers.
arXiv Detail & Related papers (2024-03-05T19:47:57Z) - De-identification of clinical free text using natural language
processing: A systematic review of current approaches [48.343430343213896]
Natural language processing has repeatedly demonstrated its feasibility in automating the de-identification process.
Our study aims to provide systematic evidence on how the de-identification of clinical free text has evolved in the last thirteen years.
arXiv Detail & Related papers (2023-11-28T13:20:41Z) - A Systematic Evaluation of GPT-4V's Multimodal Capability for Medical
Image Analysis [87.25494411021066]
GPT-4V's multimodal capability for medical image analysis is evaluated.
It is found that GPT-4V excels in understanding medical images and generates high-quality radiology reports.
It is found that its performance for medical visual grounding needs to be substantially improved.
arXiv Detail & Related papers (2023-10-31T11:39:09Z) - To ChatGPT, or not to ChatGPT: That is the question! [78.407861566006]
This study provides a comprehensive and contemporary assessment of the most recent techniques in ChatGPT detection.
We have curated a benchmark dataset consisting of prompts from ChatGPT and humans, including diverse questions from medical, open Q&A, and finance domains.
Our evaluation results demonstrate that none of the existing methods can effectively detect ChatGPT-generated content.
arXiv Detail & Related papers (2023-04-04T03:04:28Z) - Translating Radiology Reports into Plain Language using ChatGPT and
GPT-4 with Prompt Learning: Promising Results, Limitations, and Potential [6.127537348178505]
ChatGPT can successfully translate radiology reports into plain language with an average score of 4.27 in the five-point system.
ChatGPT presents some randomness in its responses with occasionally over-simplified or neglected information.
Results are compared with a newly released large model GPT-4, showing that GPT-4 can significantly improve the quality of reports.
arXiv Detail & Related papers (2023-03-16T02:21:39Z) - Does Synthetic Data Generation of LLMs Help Clinical Text Mining? [51.205078179427645]
We investigate the potential of OpenAI's ChatGPT to aid in clinical text mining.
We propose a new training paradigm that involves generating a vast quantity of high-quality synthetic data.
Our method has resulted in significant improvements in the performance of downstream tasks.
arXiv Detail & Related papers (2023-03-08T03:56:31Z) - Is ChatGPT a Good NLG Evaluator? A Preliminary Study [121.77986688862302]
We provide a preliminary meta-evaluation on ChatGPT to show its reliability as an NLG metric.
Experimental results show that compared with previous automatic metrics, ChatGPT achieves state-of-the-art or competitive correlation with human judgments.
We hope our preliminary study could prompt the emergence of a general-purposed reliable NLG metric.
arXiv Detail & Related papers (2023-03-07T16:57:20Z) - On the Robustness of ChatGPT: An Adversarial and Out-of-distribution
Perspective [67.98821225810204]
We evaluate the robustness of ChatGPT from the adversarial and out-of-distribution perspective.
Results show consistent advantages on most adversarial and OOD classification and translation tasks.
ChatGPT shows astounding performance in understanding dialogue-related texts.
arXiv Detail & Related papers (2023-02-22T11:01:20Z) - ChatGPT Makes Medicine Easy to Swallow: An Exploratory Case Study on
Simplified Radiology Reports [0.4194454151396506]
ChatGPT is a language model capable of generating text that appears human-like and authentic.
We asked 15 radiologists to assess the quality of radiology reports simplified by ChatGPT.
Most radiologists agreed that the simplified reports were factually correct, complete, and not potentially harmful to the patient.
arXiv Detail & Related papers (2022-12-30T18:55:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.