Can ChatGPT evaluate research quality?
- URL: http://arxiv.org/abs/2402.05519v1
- Date: Thu, 8 Feb 2024 10:00:40 GMT
- Title: Can ChatGPT evaluate research quality?
- Authors: Mike Thelwall
- Abstract summary: ChatGPT-4 can produce plausible document summaries and quality evaluation rationales that match REF criteria.
Overall, ChatGPT does not yet seem to be accurate enough to be trusted for any formal or informal research quality evaluation tasks.
- Score: 3.9627148816681284
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Purpose: Assess whether ChatGPT 4.0 is accurate enough to perform research
evaluations on journal articles to automate this time-consuming task.
Design/methodology/approach: Test the extent to which ChatGPT-4 can assess the
quality of journal articles using a case study of the published scoring
guidelines of the UK Research Excellence Framework (REF) 2021 to create a
research evaluation ChatGPT. This was applied to 51 of my own articles and
compared against my own quality judgements. Findings: ChatGPT-4 can produce
plausible document summaries and quality evaluation rationales that match the
REF criteria. Its overall scores have weak correlations with my self-evaluation
scores of the same documents (averaging r=0.281 over 15 iterations, with 8
being statistically significantly different from 0). In contrast, the average
scores from the 15 iterations produced a statistically significant positive
correlation of 0.509. Thus, averaging scores from multiple ChatGPT-4 rounds
seems more effective than individual scores. The positive correlation may be
due to ChatGPT being able to extract the author's significance, rigour, and
originality claims from inside each paper. If my weakest articles are removed,
then the correlation with average scores (r=0.200) falls below statistical
significance, suggesting that ChatGPT struggles to make fine-grained
evaluations. Research limitations: The data is self-evaluations of a
convenience sample of articles from one academic in one field. Practical
implications: Overall, ChatGPT does not yet seem to be accurate enough to be
trusted for any formal or informal research quality evaluation tasks. Research
evaluators, including journal editors, should therefore take steps to control
its use. Originality/value: This is the first published attempt at
post-publication expert review accuracy testing for ChatGPT.
Related papers
- Evaluating the Predictive Capacity of ChatGPT for Academic Peer Review Outcomes Across Multiple Platforms [3.3543455244780223]
This paper introduces two new contexts and employs a more robust method - averaging multiple ChatGPT scores.
Findings that averaging 30 ChatGPT predictions, based on reviewer guidelines and using only submitted titles and abstracts, failed to predict peer review outcomes for F1000Research.
arXiv Detail & Related papers (2024-11-14T19:20:33Z) - Evaluating the quality of published medical research with ChatGPT [4.786998989166]
evaluating the quality of published research is time-consuming but important for departmental evaluations, appointments, and promotions.
Previous research has shown that ChatGPT can score articles for research quality, with the results correlating positively with an indicator of quality in all fields except Clinical Medicine.
This article investigates this anomaly with the largest dataset yet and a more detailed analysis.
arXiv Detail & Related papers (2024-11-04T10:24:36Z) - Assessing the societal influence of academic research with ChatGPT: Impact case study evaluations [3.946288852327085]
This study investigates whether ChatGPT can evaluate societal impact claims.
It compares the results with published departmental average ICS scores.
The scores generated by this approach correlated positively with departmental average scores in all 34 Units of Assessment.
arXiv Detail & Related papers (2024-10-25T19:51:10Z) - Evaluating Research Quality with Large Language Models: An Analysis of ChatGPT's Effectiveness with Different Settings and Inputs [3.9627148816681284]
This article assesses which ChatGPT inputs produce better quality score estimates.
The optimal input is the article title and abstract, with average ChatGPT scores based on these correlating at 0.67 with human scores.
arXiv Detail & Related papers (2024-08-13T09:19:21Z) - CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting.
CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z) - Prometheus: Inducing Fine-grained Evaluation Capability in Language
Models [66.12432440863816]
We propose Prometheus, a fully open-source Large Language Model (LLM) that is on par with GPT-4's evaluation capabilities.
Prometheus scores a Pearson correlation of 0.897 with human evaluators when evaluating with 45 customized score rubrics.
Prometheus achieves the highest accuracy on two human preference benchmarks.
arXiv Detail & Related papers (2023-10-12T16:50:08Z) - Is GPT-4 a reliable rater? Evaluating Consistency in GPT-4 Text Ratings [63.35165397320137]
This study investigates the consistency of feedback ratings generated by OpenAI's GPT-4.
The model rated responses to tasks within the Higher Education subject domain of macroeconomics in terms of their content and style.
arXiv Detail & Related papers (2023-08-03T12:47:17Z) - Is ChatGPT a Good NLG Evaluator? A Preliminary Study [121.77986688862302]
We provide a preliminary meta-evaluation on ChatGPT to show its reliability as an NLG metric.
Experimental results show that compared with previous automatic metrics, ChatGPT achieves state-of-the-art or competitive correlation with human judgments.
We hope our preliminary study could prompt the emergence of a general-purposed reliable NLG metric.
arXiv Detail & Related papers (2023-03-07T16:57:20Z) - Integrating Rankings into Quantized Scores in Peer Review [61.27794774537103]
In peer review, reviewers are usually asked to provide scores for the papers.
To mitigate this issue, conferences have started to ask reviewers to additionally provide a ranking of the papers they have reviewed.
There are no standard procedure for using this ranking information and Area Chairs may use it in different ways.
We take a principled approach to integrate the ranking information into the scores.
arXiv Detail & Related papers (2022-04-05T19:39:13Z) - Ranking Scientific Papers Using Preference Learning [48.78161994501516]
We cast it as a paper ranking problem based on peer review texts and reviewer scores.
We introduce a novel, multi-faceted generic evaluation framework for making final decisions based on peer reviews.
arXiv Detail & Related papers (2021-09-02T19:41:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.