Evaluating the quality of published medical research with ChatGPT
- URL: http://arxiv.org/abs/2411.01952v1
- Date: Mon, 04 Nov 2024 10:24:36 GMT
- Title: Evaluating the quality of published medical research with ChatGPT
- Authors: Mike Thelwall, Xiaorui Jiang, Peter A. Bath,
- Abstract summary: evaluating the quality of published research is time-consuming but important for departmental evaluations, appointments, and promotions.
Previous research has shown that ChatGPT can score articles for research quality, with the results correlating positively with an indicator of quality in all fields except Clinical Medicine.
This article investigates this anomaly with the largest dataset yet and a more detailed analysis.
- Score: 4.786998989166
- License:
- Abstract: Evaluating the quality of published research is time-consuming but important for departmental evaluations, appointments, and promotions. Previous research has shown that ChatGPT can score articles for research quality, with the results correlating positively with an indicator of quality in all fields except Clinical Medicine. This article investigates this anomaly with the largest dataset yet and a more detailed analysis. The results showed that ChatGPT 4o-mini scores for articles submitted to the UK's Research Excellence Framework (REF) 2021 Unit of Assessment (UoA) 1 Clinical Medicine correlated positively (r=0.134, n=9872) with departmental mean REF scores, against a theoretical maximum correlation of r=0.226 (due to the departmental averaging involved). At the departmental level, mean ChatGPT scores correlated more strongly with departmental mean REF scores (r=0.395, n=31). For the 100 journals with the most articles in UoA 1, their mean ChatGPT score correlated strongly with their REF score (r=0.495) but negatively with their citation rate (r=-0.148). Journal and departmental anomalies in these results point to ChatGPT being ineffective at assessing the quality of research in prestigious medical journals or research directly affecting human health, or both. Nevertheless, the results give evidence of ChatGPT's ability to assess research quality overall for Clinical Medicine, so now there is evidence of its ability in all academic fields.
Related papers
- Evaluating the Predictive Capacity of ChatGPT for Academic Peer Review Outcomes Across Multiple Platforms [3.3543455244780223]
This paper introduces two new contexts and employs a more robust method - averaging multiple ChatGPT scores.
Findings that averaging 30 ChatGPT predictions, based on reviewer guidelines and using only submitted titles and abstracts, failed to predict peer review outcomes for F1000Research.
arXiv Detail & Related papers (2024-11-14T19:20:33Z) - Assessing the societal influence of academic research with ChatGPT: Impact case study evaluations [3.946288852327085]
This study investigates whether ChatGPT can evaluate societal impact claims.
It compares the results with published departmental average ICS scores.
The scores generated by this approach correlated positively with departmental average scores in all 34 Units of Assessment.
arXiv Detail & Related papers (2024-10-25T19:51:10Z) - Evaluating Research Quality with Large Language Models: An Analysis of ChatGPT's Effectiveness with Different Settings and Inputs [3.9627148816681284]
This article assesses which ChatGPT inputs produce better quality score estimates.
The optimal input is the article title and abstract, with average ChatGPT scores based on these correlating at 0.67 with human scores.
arXiv Detail & Related papers (2024-08-13T09:19:21Z) - Can ChatGPT evaluate research quality? [3.9627148816681284]
ChatGPT-4 can produce plausible document summaries and quality evaluation rationales that match REF criteria.
Overall, ChatGPT does not yet seem to be accurate enough to be trusted for any formal or informal research quality evaluation tasks.
arXiv Detail & Related papers (2024-02-08T10:00:40Z) - A Federated Learning Framework for Stenosis Detection [70.27581181445329]
This study explores the use of Federated Learning (FL) for stenosis detection in coronary angiography images (CA)
Two heterogeneous datasets from two institutions were considered: dataset 1 includes 1219 images from 200 patients, which we acquired at the Ospedale Riuniti of Ancona (Italy)
dataset 2 includes 7492 sequential images from 90 patients from a previous study available in the literature.
arXiv Detail & Related papers (2023-10-30T11:13:40Z) - Evaluation of ChatGPT-Generated Medical Responses: A Systematic Review
and Meta-Analysis [7.587141771901865]
Large language models such as ChatGPT are increasingly explored in medical domains.
This study aims to summarize the available evidence on evaluating ChatGPT's performance in medicine.
arXiv Detail & Related papers (2023-10-12T15:26:26Z) - Is GPT-4 a reliable rater? Evaluating Consistency in GPT-4 Text Ratings [63.35165397320137]
This study investigates the consistency of feedback ratings generated by OpenAI's GPT-4.
The model rated responses to tasks within the Higher Education subject domain of macroeconomics in terms of their content and style.
arXiv Detail & Related papers (2023-08-03T12:47:17Z) - Is ChatGPT a Good NLG Evaluator? A Preliminary Study [121.77986688862302]
We provide a preliminary meta-evaluation on ChatGPT to show its reliability as an NLG metric.
Experimental results show that compared with previous automatic metrics, ChatGPT achieves state-of-the-art or competitive correlation with human judgments.
We hope our preliminary study could prompt the emergence of a general-purposed reliable NLG metric.
arXiv Detail & Related papers (2023-03-07T16:57:20Z) - On the Robustness of ChatGPT: An Adversarial and Out-of-distribution
Perspective [67.98821225810204]
We evaluate the robustness of ChatGPT from the adversarial and out-of-distribution perspective.
Results show consistent advantages on most adversarial and OOD classification and translation tasks.
ChatGPT shows astounding performance in understanding dialogue-related texts.
arXiv Detail & Related papers (2023-02-22T11:01:20Z) - ITTC @ TREC 2021 Clinical Trials Track [54.141379782822206]
The task focuses on the problem of matching eligible clinical trials to topics constituting a summary of a patient's admission notes.
We explore different ways of representing trials and topics using NLP techniques, and then use a common retrieval model to generate the ranked list of relevant trials for each topic.
The results from all our submitted runs are well above the median scores for all topics, but there is still plenty of scope for improvement.
arXiv Detail & Related papers (2022-02-16T04:56:47Z) - Ranking Scientific Papers Using Preference Learning [48.78161994501516]
We cast it as a paper ranking problem based on peer review texts and reviewer scores.
We introduce a novel, multi-faceted generic evaluation framework for making final decisions based on peer reviews.
arXiv Detail & Related papers (2021-09-02T19:41:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.